Comments
You can use your Mastodon account to reply to this post.
I currently build a piece of software which will need a lot of bog-standard SQL queries against an in-memory database. I was (and am) going with SQLite, but a good friend of mine who Knows About Databases™ told me recently that currently, DuckDB is the hot stuff among in-memory databases - but mainly for complex (read: OLAP) queries.
In this article I explore whether DuckDB or SQLite is the better (read: faster) alternative for my simple queries.
Disclaimer: I’m not saying that the results in this article mean that “X
is better than Y
”
or “X
is bad”. I have a special use case - I only use relatively simple queries, and my data set
is comparatively small. I am fully aware that my use case is not what DuckDB is made for. I think
both DuckDB and SQLite are really great products.
Reproducing the Results: I publish my code and some additional data so that you can reproduce my results, see below.
Since I had that data lying around anyways, I am using the GTFS data for Germany as test data. The data is compiled by DELFi e.V. and can be downloaded for free (after registration) at www.opendata-oepnv.de1. It is not provided under an open license, so I cannot redistribute it. However, it is the largest GTFS data set that I could find.
I only use the trips
and stop_times
data from the GTFS data set. The trips
table contains all
public transit trips in Germany, and the stop_times
table contains all the calls that vehicles
make during those trips. The data set contains about 1.7 million trips
and about 35 million
stop_times
.
I use a small Python script (which is also available, see below) that drives SQLite v. 3.41.2 and DuckDB v. 0.9.1.
For the SQLite database, I prepare the database outside of the benchmark script using an SQL
script. The SQL script creates the trips
and stop_times
with appropriate primary and foreign keys
set (thus creating the respective indices).
The DuckDB database is initialized directly in the benchmark script using DuckDB’s read_csv_auto
method, and respective indices (see below) are created afterwards.
I want the benchmarks to run deterministically, so I create lists of values to query for a priori
and store them in two files. When I run n
queries for some value in the benchmarks, I use the
first n
values from the appropriate file.
For each of the benchmarks mentioned below, I create a list of statements to be run (and timed). I run each such list of statements twice: The first run warms up the caches, and the second run is timed. The benchmarks are run on my 2016 Thinkpad T460s laptop, which has an i5-6300U CPU and 20 GB of RAM.
I run a total of six benchmarks, which I split into two groups: Queries which can be served by looking at the built indices, and queries which can not (i.e., where the DB systems need to iterate through the tables).
The simplest benchmark is the indexed
benchmark, where each query simply retrieves a trip
by its
primary key, i.e., a query of the form SELECT * FROM trips WHERE trip_id = <ID>
. The second
benchmark, called composite-indexed
does the same for the stop_times
table, which has a
composite primary key (the primary key consists of the trip_id
of the respective trip and a
stop_sequence
integer, which just orders the calls within a trip). The third query in this group
is the join
benchmark, which retrieves stop_times
joined with the data from their respective
trip
, i.e., queries of the form
1SELECT *
2FROM stop_times JOIN trips ON trips.trip_id = stop_times.trip_id
3WHERE stop_times.trip_id = <Trip ID> and stop_sequence = <Sequence Number>;
The last benchmark in this group, called indexed-multiple
does the same as indexed
, but queries
for 50
rows per query instead of one.
For each of these benchmarks, I run 500 queries and average the query times.
The first benchmark in this group, nonindexed
, retrieves trips
by their ‘route ID’. In GTFS,
every trip is associated with a route. There are about 27000 routes in the GTFS data set, so we
expect on average about 63 trips per route. The route_id
field is not indexed. The next benchmark,
called many
, does the same but with the direction_id
field. There are only two direction IDs:
0
and 1
, so we expect each query to return half of the trips, about 850000.
For these benchmarks, I only run 100 queries each (because they take a lot longer…).
Let’s look at the results from the “with-index” benchmarks first, which are best looked at in a table.2 The reported values are milliseconds per query:
DuckDB | SQLite | Factor SQLite-to-DuckDB | |
---|---|---|---|
indexed | 0.927 | 0.063 | 14.7 |
composite-indexed | 9.68 | 0.078 | 124.1 |
join | 13.3 | 0.118 | 112.7 |
indexed-multiple | 1.00 | 0.068 | 14.7 |
Or, if you like plots more than tables:
We see that in this group, SQLite outperforms DuckDB consistently by one or two orders of
magnitude. Keep in mind that the composite-indexed
and join
benchmarks both query the larger
stop_times
table, and not the smaller trips
table. It looks to me that DuckDB somehow does not
profit from the indexing.
For the “without-index” benchmarks, we get a different picture (note the inverse factor):
DuckDB | SQLite | Factor DuckDB-to-SQLite | |
---|---|---|---|
nonindexed | 7.70 | 182 | 23.6 |
many | 2.90 | 2722 | 938.6 |
Again as the obligatory plot:
Here, DuckDB clearly outperforms SQLite, by up to three orders of magnitude. I also think it’s very
interesting that DuckDB actually gets faster in the many
case, where the output size is larger
than in the nonindexed
case.
The interpretation seems to be clear: DuckDB does not seem to work with indices as well as SQLite does (see also the section below) but is blazingly fast when scanning large parts of the table. SQLite is great at using indices but takes its time when it has to look at the whole table.
Since my workload mainly consists of simple queries which should be well answerable from an index (and it is feasible to just create as many indices as it takes…), I will stay with SQLite for the project.
The results make me very sceptical whether I did the correct things to create indices in DuckDB. I
only found a short page about indices in the DuckDB documentation, but if I read that correctly it
states that I should just use the usual CREATE INDEX
SQL command. The benchmark script therefore
does this:
1duckdb.sql("CREATE UNIQUE INDEX trips_id_idx ON trips(trip_id)")
2duckdb.sql("CREATE UNIQUE INDEX stop_times_id_idx ON stop_times(trip_id,stop_sequence)")
To test my hypothesis, I removed these two lines (i.e., did not create any indices in DuckDB) and
re-ran the benchmarks. The performance of the indexed
benchmark went down from 0.9 ms / query
to 2 ms / query
, so those CREATE INDEX
lines are not without effect. However, the
composite-indexed
, join
and indexed-multiple
benchmarks remain largely unaffected.
The documentation does say:
Unidimensional indexes are supported, while multidimensional indexes are not yet supported.
So composite-indexed
being unaffected is not a big surprise. However, the index only giving a
speedup of about 2 seems weird, and I would expect join
and indexed-multiple
to also show at
least a small speedup.
I publish my benchmark code, which you can download as a tar archive here The archive contains:
Reproducing my results should be as easy as this:
gtfs_to_sqlite.sql
script.data/
subfolder. If should at least contain the trips.txt
and stop_times.txt
files.cd data; sqlite3 gtfs.sqlite3 < gtfs_to_sqlite.sql
to create the SQLite database.bash ./prepare.sh
to create a two random lists of IDs.cd ..; python3 ./bench.py
It should output a table with the measured values and place the plots at
/tmp/dbench_plot_indexed.pdf
and /tmp/dbench_plot_nonindexed.pdf
, respectively.
Sorry, parts of that page may be in German, even though the link goes to the English website… ↩︎ ↩︎
Honestly: The plots are only there so that this post has a nice image that can appear when this article is shared… The differences between DuckDB and SQLite are just always so large that plotting them together completely hides all other detail. ↩︎
Note that the script is not necessarily a shining example for clean, well-documented Python code. However it is short, does nothing fancy and should be manageable. ↩︎
You can use your Mastodon account to reply to this post.