Comments
You can use your Mastodon account to reply to this post.
If you build an application that uses large, contiguous amounts of memory, it can increase your performance if you allocate this memory in so-called huge pages. Linux offers you two ways of doing that - a legacy way and a modern way. This article describes the modern way of using huge pages, so called transparent huge pages (THP) and applies the techniques from a previous article to verify that we actually got huge pages.
The article starts by giving a super-short recap on how paging works and why huge pages are beneficial. If you are already familiar with this, you can skip straight ahead to the two ways of allocating huge pages under Linux.
I’ll keep this super-short. For a slightly longer version have a look at my previous post.
The memory your program sees (a.k.a. virtual memory) is divided into blocks of usually 4 kB, the so-called pages. To actually use any of the memory space your program sees, the page it is located in must be associated with a frame, which is a 4 kB chunk of physical memory. This association is recorded in the page table, which is a lookup data structure maintained by the operating system but also usually used directly by the CPU. Thus, the exact structure of the page table depends on the CPU architecture.
With this in place every memory access requires a lookup (performed directly in the Memory Management Unit (MMU) in the CPU) from virtual to physical address. The page table can become large and is usually a multi-layered data structure. Performing a lookup in this large table for every single memory access would be prohibitively slow. Therefore, the MMU keeps a Translation Lookaside Buffer (TLB), which is essentially a cache of recently-used entries from the page tables. Modern CPUs usually have a multi-level TLB (similar as with data caches), so one can’t simply state a size of “the TLB”. As an example: the top-level data TLB in a Skylake CPU has 64 entries.1 Thus, memory from the 64 last-accessed pages is readily available, all other memory accesses will either fall back to a lower-level TLB cache, or in the worst case have the MMU traverse the large page table structure.
To avoid this as much as possible, it is beneficial to touch as few pages as possible in quick succession. There are many ways of optimizing your application for “locality”, one of them is using huge pages. Huge pages are just like normal pages, but larger. Skylake for example support huge pages of 2 MB, 4 MB and 1 GB sizes. If you have 1 GB of contiguous data, and you manage to pack it into a single 1 GB page, you can access all of the data with a single page table lookup - compared to 262144 of the “usual” 4 kB pages that would be necessary for 1 GB of data.
If you are reading this, you are probably thinking about building software that allocates memory in huge pages. The paragraph above reads like huge pages are always a good idea - however, that is not necessarily the case. The main factor here is that applications may allocate a lot of memory, but use only very little of it. Usually, the Linux kernel does not immediately map a frame to a page when memory is allocated - the mapping (and thus the usage of actual physical memory) only happens once the memory is accessed.
Imagine allocating 2 MB (which would nicely fit a 2 MB huge page) and then only ever use the first byte of that block. When using ordinary pages, only the first page will be mapped, thus only 4 kB of physical memory is used. However, if the allocation is done using a huge page, the full 2 MB of virtual memory need to be mapped to 2 MB of physical memory, thus wasting a lot of physical memory.
Linux has two very different ways of getting your memory allocated in huge pages. One way is called transparent huge pages (THP), and the other way … does not really have a name. They are often called HugeTLB huge pages, or explicit huge pages.
Transparent huge pages are more or less completely handled by the Linux kernel. Depending on your current kernel configuration (see the next section) it can happen that memory in your application is allocated in huge pages without you - as the programmer - even knowing. It may even happen that memory for your application that was not originally allocated in a huge page is later transformed into a huge page!
With HugeTLB huge pages however, you - as the programmer - need to explicitly state that you want a certain memory allocation to be allocated in a huge page. This approach nonetheless needs some kernel configuration, since the kernel needs to actually reserve these huge pages so that you can allocate them. If no more huge pages are available and your application requests an allocation in a huge page, the allocation will fail in this scenario. This approach appears to be deprecated. Using HugeTLB pages involves some filesystem you need to mount.2
Using explicit huge pages is way harder than and provides little benefit over using THPs, so I’m not
going to talk any more about explicit huge pages. There is one interesting benefit of using explicit
huge pages which I don’t want to keep from you: You can huge-page-ify your application without any
code changes by just linking it against libhugetlbfs. If you are interested in that I recommend just
installing libhugetlbfs
and having a look at man libhugetlbfs
.
Everything in the remainder of this article relates to THPs.
As already mentioned, THPs are managed by the kernel, to be precise by the khugepaged
kernel
thread. The khugepaged
kernel thread runs in the background and continually tries to select
consecutive “ordinary” (i.e., 4kB) pages that can be combined into a single huge page. The size of
the huge pages used for THPs is architecture dependent and cannot be changed.3 For the kernel to actually run khugepaged
, the TRANSPARENT_HUGEPAGE
kernel option must be
activated.4
There are several kernel runtime parameters that control khugepaged
(see this great guide for
details), but the most basic one you will need is set at:
1/sys/kernel/mm/transparent_hugepage/enabled
This setting has three possible values: always
, never
and madvise
. You can see what your
system is running by just running
1> cat /sys/kernel/mm/transparent_hugepage/enabled
2always [madvise] never
On my system (and this is probably the default on many systems), the madvise
setting is
active. The settings roughly mean:
khugepaged
is disabled (and will probably not even run). No THPs will be used.khugepaged
is running and will try to find consecutive pages to collapse into a
THP. It considers essentially all pages allocated by all user applications. User applications
do not need to do anything to have their memory allocations transferred into huge pages.khugepaged
will only consider pages mapped with the
MADV_HUGEPAGE
flag. Applications must explicitly set this flag on memory they allocate and can
thus control which allocations should be eligible for huge pages and which should not.You can change the setting by e.g. running (as root
):
1> echo "always" >! /sys/kernel/mm/transparent_hugepage/enabled
Note that khugepaged
can usually only consolidate ordinary 4 kB pages into a huge page if the
start of the region to be mapped into a huge page aligns to a huge page boundary. So, if you want to
use 2 MB huge pages, your memory allocation should be aligned to a multiple of 2 MB.
So, allocating huge pages via THPs is easy. You only need to make sure that THPs are enabled,
allocate your memory preferably aligned to a huge page boundary, and call madvise(…)
on your
allocated memory if necessary.
Consider this very simple example, where is_huge()
and is_thp()
inspect the page table to figure
out whether the passed address is allocated in a huge page (resp. a transparent huge page as opposed
to a HugeTLB huge page). An explanation of how this works and the implementation of these functions
can be found in my previous article on inspecting the Linux page table.
1#include <iostream>
2#include <sys/mman.h>
3
4// ... definition of is_huge() and is_thp() ...
5
6constexpr size_t HPAGE_SIZE = 2 * 1024 * 1024;
7
8int main() {
9 auto size = 4 * HPAGE_SIZE;
10 void *mem = aligned_alloc(HPAGE_SIZE, size);
11
12 madvise(mem, size, MADV_HUGEPAGE);
13
14 // Make sure the page is present
15 static_cast<char *>(mem)[0] = 'x';
16
17 std::cout << "Is huge? " << is_huge(mem) << "\n";
18 std::cout << "Is THP? " << is_thp(mem) << "\n";
19}
You can also download this as a fully self-contained example.
This code assumes that we have 2 MB huge pages (which is the size of THP pages for the x86
and
x86-64
architectures)5, and then allocates four huge pages (i.e., 8 MB) of memory
aligned to the huge page boundary. Right after the allocation, madvise(…)
is called on the
freshly allocated memory to tell khugepaged
that we want this to be allocated in huge pages. Right
after that we actually access the memory (writing the arbitrary byte ‘x’) so that the kernel
actually has to map the page to a frame.
If you downloaded thp.cpp from above, you should be able to reproduce this example like this:
1> g++ --std=c++17 thp.cpp -o thp
2> sudo ./thp
3Is huge? 1
4Is THP? 1
This is what you should see if your khugepaged
setting is either always
or madvise
. Try
setting that setting to never
, or setting it to madvise
and removing the madvise(…)
call from
the code. This should change the output to Is huge? 0
.
Allocating all your memory aligned to a 2 MB boundary may be cumbersome, and if you intend to compile
your code for different architectures, this may even mean that you need to align dependent on the
current architecture. Also, if you want to use background defragmentation to have your application
use THPs without any code changes (i.e., set /sys/kernel/mm/transparent_hugepage/enabled
to
always
), your allocations will probably not be aligned.
Does this mean you cannot use THP? No, you are probably fine. It is true that khugepaged
can only
consolidate ordinary pages into a huge page starting on a huge-page-aligned boundary, i.e., a memory
address that is a multiple of the huge page size. But if we assume that your allocations are
“large”6, this is not a very strict
requirement. If your (unaligned) allocation spans multiple huge pages, the “middle” of your
allocation will actually fit neatly into huge-page-aligned memory, and only the beginning and end of
your allocation will end up in ordinary pages:
Let’s verify this in code. I have modified the example from the previous section to explicitly allocate memory that is not aligned to a huge page boundary:
1// Return <size> bytes of allocated memory guaranteed *not* to be aligned to a THP boundary.
2void * allocate_unaligned(size_t size);
3
4int main() {
5 auto size = 4 * HPAGE_SIZE;
6
7 void * mem = allocate_unaligned(size);
8
9 madvise(mem, size, MADV_HUGEPAGE);
10
11 // Make sure all pages are present
12 memset(mem, 'x', size);
13
14 std::cout << "Start of memory: Is huge? " << is_huge(mem) << "\n";
15 std::cout << "Start of memory: Is THP? " << is_thp(mem) << "\n";
16
17 // This gives us a pointer to the next (starting from <mem>) address that is
18 // aligned to HPAGE_SIZE.
19 void *nextHPageAlignedPtr =
20 reinterpret_cast<void *>(reinterpret_cast<size_t>(mem) + HPAGE_SIZE -
21 reinterpret_cast<size_t>(mem) % HPAGE_SIZE);
22
23 std::cout << "On the next THP aligned address: Is huge? " << is_huge(nextHPageAlignedPtr) << "\n";
24 std::cout << "On the next THP aligned address: Is THP? " << is_thp(nextHPageAlignedPtr) << "\n";
25}
allocate_unalined()
implementation 1void * allocate_unaligned(size_t size) {
2 void *mem = mmap(0, size + PAGE_SIZE, (PROT_READ | PROT_WRITE),
3 (MAP_PRIVATE | MAP_ANONYMOUS), 0, 0);
4 assert(mem != MAP_FAILED);
5 if ((reinterpret_cast<size_t>(mem) % HPAGE_SIZE) == 0) {
6 // We randomly got huge-page-aligned memory. Unmap and map again with a
7 // shift of one (non-huge!) page.
8 std::cout << "First attempt yielded aligned memory. Remapping.\n";
9 void *target_addr =
10 reinterpret_cast<void *>(reinterpret_cast<size_t>(mem) + PAGE_SIZE);
11 munmap(mem, size + PAGE_SIZE);
12 mem = mmap(target_addr, size, (PROT_READ | PROT_WRITE),
13 (MAP_PRIVATE | MAP_ANONYMOUS), 0, 0);
14 assert(mem != MAP_FAILED);
15 }
16
17 return mem;
18}
Again, you can download this as a fully self-contained example that you should be able to build
using g++ --std=c++17 thp_unaligned.cpp -o thp_unaligned
. What does the code do? The magic
allocate_unaligned()
function call (expand the drawer above for the implementation) allocates 8 MB
of memory not aligned to a THP boundary. We then advise the kernel that we want THP pages in line
9
. In line 17
, we build a special pointer into our newly allocated memory: We basically “round
up” the memory address mem
to the next multiple of HPAGE_SIZE
, giving us a pointer into the
green area of the drawing above, i.e., the block of memory that we expect to be allocated in huge pages.
And indeed, when executing this, I get on my machine:
1Start of memory: Is huge? 0
2Start of memory: Is THP? 0
3On the next THP aligned address: Is huge? 1
4On the next THP aligned address: Is THP? 1
So we see that khugepaged
is rather forgiving with unaligned memory: If your allocations are
large6, most of your memory allocation will end up in a huge page.
This concludes my little tour around transparent huge pages. THPs are surprisingly simple to use - a
simple madvise
call is all that is necessary in the default case. Adventurous system engineers
might even just set …/transparent_hugepage/enabled
to always
and enjoy huge pages even in
applications they have no control over. I’m currently thinking about evaluating that for machines
hosting some memory-intensive tasks. If you have any experience using the always
setting, I’d love
to hear how that went for you.
The benefits of which are unclear. I tried asking aground what the advantage of the filesystem approach is, but got no real answers. If you want to weigh in on this, feel free to leave a comment or reply to my SE question. ↩︎
This is hardcoded as a macro in the Linux kernel, where the exact size (i.e., PMD_SHIFT), is architecture dependent. For x86, it is defined to be 21, which gives us 221 Bytes = 2 * 1024 * 1024 Bytes = 2 MB. ↩︎
This should be the case in basically all Linux kernels shipped with major distributions. ↩︎
Have a look at /sys/kernel/mm/transparent_hugepage/hpage_pmd_size
to
verify the THP size on your system. ↩︎
Note that even if you allocate seemingly small pieces of memory, e.g. via malloc
,
these will usually be part of a larger memory allocations. Functions like malloc
, new
and
friends do not directly allocate memory, they are not syscalls. They are calls into your C runtime
library. The memory allocation implementation of that C library will usually try to make large
allocations on an OS level (because these are expensive), and serve smaller user allocations (i.e.,
calls to malloc
and friends) from these larger allocated areas.] ↩︎ ↩︎
You can use your Mastodon account to reply to this post.