Allocating Huge Pages on Linux

Published on Sep 11, 2023 13 min read

If you build an application that uses large, contiguous amounts of memory, it can increase your performance if you allocate this memory in so-called huge pages. Linux offers you two ways of doing that - a legacy way and a modern way. This article describes the modern way of using huge pages, so called transparent huge pages (THP) and applies the techniques from a previous article to verify that we actually got huge pages.

The article starts by giving a super-short recap on how paging works and why huge pages are beneficial. If you are already familiar with this, you can skip straight ahead to the two ways of allocating huge pages under Linux.

Linux Paging Recap

I’ll keep this super-short. For a slightly longer version have a look at my previous post.

The memory your program sees (a.k.a. virtual memory) is divided into blocks of usually 4 kB, the so-called pages. To actually use any of the memory space your program sees, the page it is located in must be associated with a frame, which is a 4 kB chunk of physical memory. This association is recorded in the page table, which is a lookup data structure maintained by the operating system but also usually used directly by the CPU. Thus, the exact structure of the page table depends on the CPU architecture.

With this in place every memory access requires a lookup (performed directly in the Memory Management Unit (MMU) in the CPU) from virtual to physical address. The page table can become large and is usually a multi-layered data structure. Performing a lookup in this large table for every single memory access would be prohibitively slow. Therefore, the MMU keeps a Translation Lookaside Buffer (TLB), which is essentially a cache of recently-used entries from the page tables. Modern CPUs usually have a multi-level TLB (similar as with data caches), so one can’t simply state a size of “the TLB”. As an example: the top-level data TLB in a Skylake CPU has 64 entries.¹ Thus, memory from the 64 last-accessed pages is readily available, all other memory accesses will either fall back to a lower-level TLB cache, or in the worst case have the MMU traverse the large page table structure.

To avoid this as much as possible, it is beneficial to touch as few pages as possible in quick succession. There are many ways of optimizing your application for “locality”, one of them is using huge pages. Huge pages are just like normal pages, but larger. Skylake for example support huge pages of 2 MB, 4 MB and 1 GB sizes. If you have 1 GB of contiguous data, and you manage to pack it into a single 1 GB page, you can access all of the data with a single page table lookup - compared to 262144 of the “usual” 4 kB pages that would be necessary for 1 GB of data.

Caveats

If you are reading this, you are probably thinking about building software that allocates memory in huge pages. The paragraph above reads like huge pages are always a good idea - however, that is not necessarily the case. The main factor here is that applications may allocate a lot of memory, but use only very little of it. Usually, the Linux kernel does not immediately map a frame to a page when memory is allocated - the mapping (and thus the usage of actual physical memory) only happens once the memory is accessed.

Imagine allocating 2 MB (which would nicely fit a 2 MB huge page) and then only ever use the first byte of that block. When using ordinary pages, only the first page will be mapped, thus only 4 kB of physical memory is used. However, if the allocation is done using a huge page, the full 2 MB of virtual memory need to be mapped to 2 MB of physical memory, thus wasting a lot of physical memory.

Two Ways of Allocating Huge Pages

Linux has two very different ways of getting your memory allocated in huge pages. One way is called transparent huge pages (THP), and the other way … does not really have a name. They are often called HugeTLB huge pages, or explicit huge pages.

Transparent huge pages are more or less completely handled by the Linux kernel. Depending on your current kernel configuration (see the next section) it can happen that memory in your application is allocated in huge pages without you - as the programmer - even knowing. It may even happen that memory for your application that was not originally allocated in a huge page is later transformed into a huge page!

With HugeTLB huge pages however, you - as the programmer - need to explicitly state that you want a certain memory allocation to be allocated in a huge page. This approach nonetheless needs some kernel configuration, since the kernel needs to actually reserve these huge pages so that you can allocate them. If no more huge pages are available and your application requests an allocation in a huge page, the allocation will fail in this scenario. This approach appears to be deprecated. Using HugeTLB pages involves some filesystem you need to mount.²

Using explicit huge pages is way harder than and provides little benefit over using THPs, so I’m not going to talk any more about explicit huge pages. There is one interesting benefit of using explicit huge pages which I don’t want to keep from you: You can huge-page-ify your application without any code changes by just linking it against libhugetlbfs. If you are interested in that I recommend just installing libhugetlbfs and having a look at man libhugetlbfs.

Everything in the remainder of this article relates to THPs.

Transparent Huge Pages

As already mentioned, THPs are managed by the kernel, to be precise by the khugepaged kernel thread. The khugepaged kernel thread runs in the background and continually tries to select consecutive “ordinary” (i.e., 4kB) pages that can be combined into a single huge page. The size of the huge pages used for THPs is architecture dependent and cannot be changed.³ For the kernel to actually run khugepaged, the TRANSPARENT_HUGEPAGE kernel option must be activated.⁴

There are several kernel runtime parameters that control khugepaged (see this great guide for details), but the most basic one you will need is set at:

1/sys/kernel/mm/transparent_hugepage/enabled

This setting has three possible values: always, never and madvise. You can see what your system is running by just running

1> cat /sys/kernel/mm/transparent_hugepage/enabled
2always [madvise] never

On my system (and this is probably the default on many systems), the madvise setting is active. The settings roughly mean:

never: khugepaged is disabled (and will probably not even run). No THPs will be used.
always: khugepaged is running and will try to find consecutive pages to collapse into a THP. It considers essentially all pages allocated by all user applications. User applications do not need to do anything to have their memory allocations transferred into huge pages.
madvise: Similar to always, but khugepaged will only consider pages mapped with the MADV_HUGEPAGE flag. Applications must explicitly set this flag on memory they allocate and can thus control which allocations should be eligible for huge pages and which should not.

You can change the setting by e.g. running (as root):

1> echo "always" >! /sys/kernel/mm/transparent_hugepage/enabled

Note that khugepaged can usually only consolidate ordinary 4 kB pages into a huge page if the start of the region to be mapped into a huge page aligns to a huge page boundary. So, if you want to use 2 MB huge pages, your memory allocation should be aligned to a multiple of 2 MB.

So, allocating huge pages via THPs is easy. You only need to make sure that THPs are enabled, allocate your memory preferably aligned to a huge page boundary, and call madvise(…) on your allocated memory if necessary.

Consider this very simple example, where is_huge() and is_thp() inspect the page table to figure out whether the passed address is allocated in a huge page (resp. a transparent huge page as opposed to a HugeTLB huge page). An explanation of how this works and the implementation of these functions can be found in my previous article on inspecting the Linux page table.

 1#include <iostream>
 2#include <sys/mman.h>
 3
 4// ... definition of is_huge() and is_thp() ...
 5
 6constexpr size_t HPAGE_SIZE = 2 * 1024 * 1024;
 7
 8int main() {
 9  auto size = 4 * HPAGE_SIZE;
10  void *mem = aligned_alloc(HPAGE_SIZE, size);
11
12  madvise(mem, size, MADV_HUGEPAGE);
13
14	// Make sure the page is present
15  static_cast<char *>(mem)[0] = 'x';
16
17  std::cout << "Is huge? " << is_huge(mem) << "\n";
18  std::cout << "Is THP? " << is_thp(mem) << "\n";
19}

You can also download this as a fully self-contained example.

This code assumes that we have 2 MB huge pages (which is the size of THP pages for the x86 and x86-64 architectures)⁵, and then allocates four huge pages (i.e., 8 MB) of memory aligned to the huge page boundary. Right after the allocation, madvise(…) is called on the freshly allocated memory to tell khugepaged that we want this to be allocated in huge pages. Right after that we actually access the memory (writing the arbitrary byte ‘x’) so that the kernel actually has to map the page to a frame.

If you downloaded thp.cpp from above, you should be able to reproduce this example like this:

1> g++ --std=c++17 thp.cpp -o thp
2> sudo ./thp
3Is huge? 1
4Is THP? 1

This is what you should see if your khugepaged setting is either always or madvise. Try setting that setting to never, or setting it to madvise and removing the madvise(…) call from the code. This should change the output to Is huge? 0.

Unaligned Memory

Allocating all your memory aligned to a 2 MB boundary may be cumbersome, and if you intend to compile your code for different architectures, this may even mean that you need to align dependent on the current architecture. Also, if you want to use background defragmentation to have your application use THPs without any code changes (i.e., set /sys/kernel/mm/transparent_hugepage/enabled to always), your allocations will probably not be aligned.

Does this mean you cannot use THP? No, you are probably fine. It is true that khugepaged can only consolidate ordinary pages into a huge page starting on a huge-page-aligned boundary, i.e., a memory address that is a multiple of the huge page size. But if we assume that your allocations are “large”⁶, this is not a very strict requirement. If your (unaligned) allocation spans multiple huge pages, the “middle” of your allocation will actually fit neatly into huge-page-aligned memory, and only the beginning and end of your allocation will end up in ordinary pages:

Figure 1: An allocation not aligned to the huge page boundary. The green part of the allocation actually ends up being allocated in huge pages, only the pink parts are in ordinary pages.

Let’s verify this in code. I have modified the example from the previous section to explicitly allocate memory that is not aligned to a huge page boundary:

 1// Return <size> bytes of allocated memory guaranteed *not* to be aligned to a THP boundary.
 2void * allocate_unaligned(size_t size);
 3
 4int main() {
 5  auto size = 4 * HPAGE_SIZE;
 6
 7	void * mem = allocate_unaligned(size);
 8
 9  madvise(mem, size, MADV_HUGEPAGE);
10
11  // Make sure all pages are present
12  memset(mem, 'x', size);
13
14  std::cout << "Start of memory: Is huge? " << is_huge(mem) << "\n";
15  std::cout << "Start of memory: Is THP? " << is_thp(mem) << "\n";
16
17	// This gives us a pointer to the next (starting from <mem>) address that is
18	// aligned to HPAGE_SIZE.
19  void *nextHPageAlignedPtr =
20      reinterpret_cast<void *>(reinterpret_cast<size_t>(mem) + HPAGE_SIZE -
21                               reinterpret_cast<size_t>(mem) % HPAGE_SIZE);
22
23  std::cout << "On the next THP aligned address: Is huge? " << is_huge(nextHPageAlignedPtr) << "\n";
24  std::cout << "On the next THP aligned address: Is THP? " << is_thp(nextHPageAlignedPtr) << "\n";
25}

allocate_unalined() implementation

 1void * allocate_unaligned(size_t size) {
 2  void *mem = mmap(0, size + PAGE_SIZE, (PROT_READ | PROT_WRITE),
 3                   (MAP_PRIVATE | MAP_ANONYMOUS), 0, 0);
 4  assert(mem != MAP_FAILED);
 5  if ((reinterpret_cast<size_t>(mem) % HPAGE_SIZE) == 0) {
 6    // We randomly got huge-page-aligned memory. Unmap and map again with a
 7    // shift of one (non-huge!) page.
 8    std::cout << "First attempt yielded aligned memory. Remapping.\n";
 9    void *target_addr =
10        reinterpret_cast<void *>(reinterpret_cast<size_t>(mem) + PAGE_SIZE);
11    munmap(mem, size + PAGE_SIZE);
12    mem = mmap(target_addr, size, (PROT_READ | PROT_WRITE),
13               (MAP_PRIVATE | MAP_ANONYMOUS), 0, 0);
14    assert(mem != MAP_FAILED);
15  }
16
17	return mem;
18}

Again, you can download this as a fully self-contained example that you should be able to build using g++ --std=c++17 thp_unaligned.cpp -o thp_unaligned. What does the code do? The magic allocate_unaligned() function call (expand the drawer above for the implementation) allocates 8 MB of memory not aligned to a THP boundary. We then advise the kernel that we want THP pages in line 9. In line 17, we build a special pointer into our newly allocated memory: We basically “round up” the memory address mem to the next multiple of HPAGE_SIZE, giving us a pointer into the green area of the drawing above, i.e., the block of memory that we expect to be allocated in huge pages.

And indeed, when executing this, I get on my machine:

1Start of memory: Is huge? 0
2Start of memory: Is THP? 0
3On the next THP aligned address: Is huge? 1
4On the next THP aligned address: Is THP? 1

So we see that khugepaged is rather forgiving with unaligned memory: If your allocations are large⁶, most of your memory allocation will end up in a huge page.

Conclusion

This concludes my little tour around transparent huge pages. THPs are surprisingly simple to use - a simple madvise call is all that is necessary in the default case. Adventurous system engineers might even just set …/transparent_hugepage/enabled to always and enjoy huge pages even in applications they have no control over. I’m currently thinking about evaluating that for machines hosting some memory-intensive tasks. If you have any experience using the always setting, I’d love to hear how that went for you.

As per Wikichip ↩︎
The benefits of which are unclear. I tried asking aground what the advantage of the filesystem approach is, but got no real answers. If you want to weigh in on this, feel free to leave a comment or reply to my SE question. ↩︎
This is hardcoded as a macro in the Linux kernel, where the exact size (i.e., PMD_SHIFT), is architecture dependent. For x86, it is defined to be 21, which gives us 2²¹ Bytes = 2 * 1024 * 1024 Bytes = 2 MB. ↩︎
This should be the case in basically all Linux kernels shipped with major distributions. ↩︎
Have a look at /sys/kernel/mm/transparent_hugepage/hpage_pmd_size to verify the THP size on your system. ↩︎
Note that even if you allocate seemingly small pieces of memory, e.g. via malloc, these will usually be part of a larger memory allocations. Functions like malloc, new and friends do not directly allocate memory, they are not syscalls. They are calls into your C runtime library. The memory allocation implementation of that C library will usually try to make large allocations on an OS level (because these are expensive), and serve smaller user allocations (i.e., calls to malloc and friends) from these larger allocated areas.] ↩︎ ↩︎