Fixing a bug in NewsFlash and shaving a yak.
The Initial Task
There is a bug in NewsFlash, my RSS reader of choice. When using the light theme, <blockquote>’s from this feed have the same white font color as the background and are therefore unreadable. Fixing this seemed like a simple and quick task: It consists of just 18k lines of Rust code (although I later found out it depends NewsFlash is split into two rust crates, called news_flash_gtk
and news_flash
, which is another 13k lines of rust code) so it cannot be that hard to read. find -name *.css
returns three results, one of them is news_flash_gtk/data/resources/article_view/style.css
. Naturally, I suspected that the reason some text is white is that this css file sets it to white or at least forgets to set it to black. So, it should be a quick fix. But it turned out to be quite a yak-shaving…
Building NewsFlash
NewsFlash is build with Meson. I like Meson quite a lot, because its build definitions are really boring and readable. As recommended, I build it with
meson --prefix=/usr build
cd build
ninja
which worked just fine. Unfortunately, changing style.css
and re-running ninja
did not trigger a rebuild. This could have two possible reasons: - style.css
is just dead code. - style.css
is read in some build step, but not specified as a dependency in the build definition. I unsuccessfully tried a while to find in which build step style.css
is read. The winning idea was to prevent the file from being red using chmod 000 style.css
, then do a clean rebuild and see who complains. error: proc-macro derive panicked
--> src/article_view/mod.rs:39:10
|
39 | #[derive(RustEmbed)]
| ^^^^^^^^^
|
= help: message: File should be readable: Os { code: 13, kind: PermissionDenied, message: "Permission denied" }
Bingo. In hindsight, I should have came up with that idea sooner. Understanding why touching style.css
did not trigger a rebuild is simple: news_flash_gtk/src/meson.build
includes the lines
newsflash_sources = files(
'article_list/models/article.rs',
...
)
cargo_release = custom_target('cargo-build',
build_by_default: true,
input: [
newsflash_sources,
],
output: ['com.gitlab.newsflash'],
install: true,
install_dir: newsflash_bindir,
console: true,
command: [cargo_script,
'@SOURCE_ROOT@',
'@OUTPUT@',
meson.build_root(),
profile,
'--features "@0@"'.format(features)
])
Running ninja
will only rerun cargo
if one of the files listed in newsflash_sources
changed. Added styles.css
to newsflash_sources
therefore fixes this bug. I send this patch upstream and it got merged.
Database Problems
Now that the build works, lets start NewsFlash. Once I attempt to selectLocal RSS
(or attempt to log in), nothing happens but this is printed to stderr. 02:20:41 - ERROR - Database migration failed: Failed with: FOREIGN KEY constraint failed (news_flash::database:181)
Oh Fuck Me. Deleting the config folder that stores the database does not change me.
Failing Tests
After debugging for a while, I thought: I should run the unittests. Maybe there is some minimal database example that triggers the same bug.
volker ~ Sync git news_flash_gtk cargo test
Finished test [unoptimized + debuginfo] target(s) in 0.62s
Running unittests (target/debug/deps/news_flash_gtk-0f333f82bdbfa5b7)
running 8 tests
test color::tests::hsla_to_rgba ... ok
test color::tests::parse_color_string ... ok
test color::tests::rgba_to_hsla ... ok
test i18n::tests::test_i18n ... ok
test i18n::tests::test_i18n_f ... ok
test i18n::tests::test_pi18n ... ok
test i18n::tests::test_i18n_k ... ok
test util::mercury::tests::parse_phoronix ... ok
test result: ok. 8 passed; 0 failed; 0 ignored; 0 measured; 0 filtered out; finished in 10.55s
error: test failed, to rerun pass '--bin news_flash_gtk'
Caused by:
process didn't exit successfully: `/home/volker/Sync/git/news_flash_gtk/target/debug/deps/news_flash_gtk-0f333f82bdbfa5b7` (signal: 6, SIGABRT: process abort signal)
volker ~ Sync git news_flash_gtk
ok
marks … but it failed anyway? It took me a while to understand what is happening: It turns out that the parse_phoronix
function contains code, that will make the program crash. But this crash is delayed - control flow reaches the end of the parse_phoronix
function and returns back to the test harness. It is the exit handler that aborts: Thread 1 "news_flash_gtk-" received signal SIGABRT, Aborted.
0x00007ffff227a36c in ?? () from /usr/lib/libc.so.6
(gdb) bt
#0 0x00007ffff227a36c in () at /usr/lib/libc.so.6
#1 0x00007ffff222a838 in raise () at /usr/lib/libc.so.6
#2 0x00007ffff2214535 in abort () at /usr/lib/libc.so.6
#3 0x00007ffff54774ac in () at /usr/lib/libwebkit2gtk-5.0.so.0
#4 0x00007ffff586ecdd in () at /usr/lib/libwebkit2gtk-5.0.so.0
#5 0x00007ffff57d5c5b in () at /usr/lib/libwebkit2gtk-5.0.so.0
#6 0x00007ffff2cb1b35 in g_object_unref () at /usr/lib/libgobject-2.0.so.0
#7 0x00007ffff57ddfbc in () at /usr/lib/libwebkit2gtk-5.0.so.0
#8 0x00007ffff2cb1b35 in g_object_unref () at /usr/lib/libgobject-2.0.so.0
#9 0x00007ffff222cef5 in () at /usr/lib/libc.so.6
#10 0x00007ffff222d070 in on_exit () at /usr/lib/libc.so.6
#11 0x00007ffff2215297 in () at /usr/lib/libc.so.6
#12 0x00007ffff221534a in __libc_start_main () at /usr/lib/libc.so.6
#13 0x00005555555dfec5 in _start () at ../sysdeps/x86_64/start.S:115
(gdb)
Rust Linking Errors
I suspected that theparse_phoronix
function corrupts memory, which leads to this crash. So, I decided to ran everything under ASAN. The documentation says I should compile with RUSTFLAGS="-Zsanitizer=address"
. So let’s try that: error: /home/volker/Sync/git/news_flash_gtk/target/debug/deps/libproc_macro_error_attr-6323cc5d4e1386c9.so: undefined symbol: __asan_option_detect_stack_use_after_return
--> /home/volker/.cargo/registry/src/github.com-1ecc6299db9ec823/proc-macro-error-1.0.4/src/lib.rs:284:9
|
284 | pub use proc_macro_error_attr::proc_macro_error;
| ^^^^^^^^^^^^^^^^^^^^^
A fucking linking error in Rust. What a rare sight. It turns out that you need an additional flag, so I created a PR that adds a note to the documentation. I am a bit disappointed that Rust does not catch this error and gives you a nicer error message. Anyway, ASAN confirmed the absence of memory corruption. So, our search for the bug that causes the SIGABRT continues.
WebKit Debug Symbols
I want to have debug symbols in my backtrace. So let’s try to compile WebKit with debug symbols. I followed their documentation, then compiled NewsFlash using
export LIBRARY_PATH="/run/media/volker/DATA/cloned/webkitgtk-2.36.1/lib"
export LD_LIBRARY_PATH="/run/media/volker/DATA/cloned/webkitgtk-2.36.1/lib"
cargo build
Now test fails with ** (process:222197): ERROR **: 22:15:48.714: Unable to spawn a new child process: Failed to spawn child process “/usr/local/libexec/webkit2gtk-4.1/WebKitNetworkProcess” (No such file or directory)
Oh Fuck Me Hard. Looks like there are some subtleties in how WebKit is built. So let’s download the PKGBUILD that describes how Arch Linux builds Webkit and modify it to include Debug symbols. Once we run it in gdb, we are greeted by:
warning: Could not find DWO CU Source/WebKit/CMakeFiles/WebKit.dir/__/__/DerivedSources/WebKit/AutomationBackendDispatchers.cpp.dwo(0x8819e41405bd3bc0) referenced by CU at offset 0x23 [in module /usr/lib/debug/usr/lib/libwebkit2gtk-5.0.so.0.0.0.debug]
Dwarf Error: unexpected tag 'DW_TAG_skeleton_unit' at offset 0x23 [in module /usr/lib/debug/usr/lib/libwebkit2gtk-5.0.so.0.0.0.debug]
warning: Could not find DWO CU Source/JavaScriptCore/CMakeFiles/LowLevelInterpreterLib.dir/llint/LowLevelInterpreter.cpp.dwo(0x8640769469a7cd6a) referenced by CU at offset 0x23 [in module /usr/lib/debug/usr/lib/libjavascriptcoregtk-5.0.so.0.0.0.debug]
Dwarf Error: unexpected tag 'DW_TAG_skeleton_unit' at offset 0x23 [in module /usr/lib/debug/usr/lib/libjavascriptcoregtk-5.0.so.0.0.0.debug]
Angry German noises.
It took me longer than it should to understand both warnings: Those *.dwo files get build, but not installed. Source/.../AutomationBackendDispatchers.cpp.dwo
does exist in the build directory. So let’s cd
into the build directory and try again:
Dwarf Error: unexpected tag 'DW_TAG_skeleton_unit' at offset 0x47fc [in module /usr/lib/debug/usr/lib/libjavascriptcoregtk-5.0.so.0.0.0.debug]
Exception ignored in: <gdb._GdbOutputFile object at 0x7fdccb07fa30>
Traceback (most recent call last):
File "/usr/share/gdb/python/gdb/__init__.py", line 47, in flush
def flush(self):
KeyboardInterrupt:
Lol, What? I did not touch the keyboard. Running sudo journalctl -b
shows the true reason:
Mai 26 00:01:17 battle earlyoom[513]: sending SIGTERM to process 40789 uid 1000 "gdb": badness 1124, VmRSS 4782 MiB
GDB uses up to much memory, so it gets killed. I have 8 GB of RAM and 24 GB of Swap, but apparently that is not enough. I opened an issue at bugzilla to complain about how GDB claims it was killed by a KeyboardInterrupt, even though it was a SIGTERM.
Next try: Let’s replace -ggdb3
with -ggdb1
. This should generate less debuginfo and should therefore use less RAM.
… Nope, still eats too much RAM.
I am running Arch Linux on my machine here. The main reason why I use Arch Linux is because they have a really, really detailed wiki. In fact, I found something about my problem the wiki:
Alternatively you can put the debug information in a separate package by enabling both debug and strip, debug symbols will then be stripped from the main package and placed, together with source files to aid in stepping through the debugger, in a separate pkgbase-debug package. This is advantageous if the package contains very large binaries (e.g. over a GB with debug symbols included) as it might cause freezing and other strange, unwanted behavior occurring.
Ok, so lets try that … Nope, same problem persists: OOM if I cd into build
and warnings about non existing .dwo files otherwise.
Downloading Debug Symbols
I don’t think trying to solve the OOM issue from the previous section would have a good cost/reward ratio. So, let’s give up on building it by hand and just use the best feature since sliced bread: For Arch Linux packages (and webkit2gtk is an Arch Linux package) gdb gives you this prompt:
This GDB supports auto-downloading debuginfo from the following URLs:
https://debuginfod.archlinux.org
Enable debuginfod for this session? (y or [n])
Upon confirming this prompt, gdb will download the correct debug symbols and we get our backtrace:
#0 __pthread_kill_implementation (threadid=<optimized out>, signo=signo@entry=6, no_tid=no_tid@entry=0) at pthread_kill.c:44
#1 0x00007ffff208e3d3 in __pthread_kill_internal (signo=6, threadid=<optimized out>) at pthread_kill.c:78
#2 0x00007ffff203e838 in __GI_raise (sig=sig@entry=6) at ../sysdeps/posix/raise.c:26
#3 0x00007ffff2028535 in __GI_abort () at abort.c:79
#4 0x00007ffff543e4bc in WTFCrashWithInfo(int, char const*, char const*, int) () at /usr/src/debug/build/WTF/Headers/wtf/Assertions.h:741
#5 allDataStores () at /usr/src/debug/webkitgtk-2.36.3/Source/WebKit/UIProcess/WebsiteData/WebsiteDataStore.cpp:101
#6 WebKit::WebsiteDataStore::~WebsiteDataStore() () at /usr/src/debug/webkitgtk-2.36.3/Source/WebKit/UIProcess/WebsiteData/WebsiteDataStore.cpp:152
#7 0x00007ffff583626d in WebKit::WebsiteDataStore::~WebsiteDataStore() () at /usr/src/debug/webkitgtk-2.36.3/Source/WebKit/UIProcess/WebsiteData/WebsiteDataStore.cpp:158
#8 0x00007ffff579d04b in WTF::ThreadSafeRefCounted<API::Object, (WTF::DestructionThread)0>::deref() const::{lambda()#1}::operator()() const () at /usr/src/debug/build/WTF/Headers/wtf/ThreadSafeRefCounted.h:117
#9 WTF::ThreadSafeRefCounted<API::Object, (WTF::DestructionThread)0>::deref() const () at /usr/src/debug/build/WTF/Headers/wtf/ThreadSafeRefCounted.h:129
#10 WTF::DefaultRefDerefTraits<WebKit::WebsiteDataStore>::derefIfNotNull(WebKit::WebsiteDataStore*) () at /usr/src/debug/build/WTF/Headers/wtf/RefPtr.h:42
#11 WTF::RefPtr<WebKit::WebsiteDataStore, WTF::RawPtrTraits<WebKit::WebsiteDataStore>, WTF::DefaultRefDerefTraits<WebKit::WebsiteDataStore> >::~RefPtr() () at /usr/src/debug/build/WTF/Headers/wtf/RefPtr.h:74
#12 _WebKitWebsiteDataManagerPrivate::~_WebKitWebsiteDataManagerPrivate() () at /usr/src/debug/webkitgtk-2.36.3/Source/WebKit/UIProcess/API/glib/WebKitWebsiteDataManager.cpp:100
#13 webkit_website_data_manager_finalize() () at /usr/src/debug/webkitgtk-2.36.3/Source/WebKit/UIProcess/API/glib/WebKitWebsiteDataManager.cpp:118
#14 0x00007ffff4269b35 in g_object_unref (_object=<optimized out>) at ../glib/gobject/gobject.c:3678
#15 g_object_unref (_object=0x7fffe825bd20) at ../glib/gobject/gobject.c:3553
#16 0x00007ffff57a53ac in WTF::derefGPtr<_WebKitWebsiteDataManager>(_WebKitWebsiteDataManager*) () at /usr/src/debug/build/WTF/Headers/wtf/glib/GRefPtr.h:269
#17 WTF::GRefPtr<_WebKitWebsiteDataManager>::~GRefPtr() () at /usr/src/debug/build/WTF/Headers/wtf/glib/GRefPtr.h:82
#18 _WebKitWebContextPrivate::~_WebKitWebContextPrivate() () at /usr/src/debug/webkitgtk-2.36.3/Source/WebKit/UIProcess/API/glib/WebKitWebContext.cpp:205
#19 webkit_web_context_finalize() () at /usr/src/debug/webkitgtk-2.36.3/Source/WebKit/UIProcess/API/glib/WebKitWebContext.cpp:305
#20 0x00007ffff4269b35 in g_object_unref (_object=<optimized out>) at ../glib/gobject/gobject.c:3678
#21 g_object_unref (_object=0x7fffe8246140) at ../glib/gobject/gobject.c:3553
#22 0x00007ffff2040ef5 in __run_exit_handlers (status=0, listp=0x7ffff21fe778 <__exit_funcs>, run_list_atexit=run_list_atexit@entry=true, run_dtors=run_dtors@entry=true) at exit.c:113
#23 0x00007ffff2041070 in __GI_exit (status=<optimized out>) at exit.c:143
#24 0x00007ffff2029297 in __libc_start_call_main (main=main@entry=0x5555556c8630 <main>, argc=argc@entry=3, argv=argv@entry=0x7fffffffd818) at ../sysdeps/nptl/libc_start_call_main.h:74
#25 0x00007ffff202934a in __libc_start_main_impl (main=0x5555556c8630 <main>, argc=3, argv=0x7fffffffd818, init=<optimized out>, fini=<optimized out>, rtld_fini=<optimized out>, stack_end=0x7fffffffd808) at ../csu/libc-start.c:392
#26 0x00005555556c70e5 in _start () at ../sysdeps/x86_64/start.S:115
The good news: We have a backtrace. The bad news: It is a backtrace straight from hell. The good thing is that we can now understand how the crash happens after the test passed: The Webkit/GTK functions called by the unit test uses atexit to register a function (g_object_unref
, the #22 in this bt) that is called upon termination. This function aborts. Why? Because #5 in this bt triggers this assertion:
static HashMap<PAL::SessionID, WebsiteDataStore*>& allDataStores()
{
RELEASE_ASSERT(isUIThread());
static NeverDestroyed<HashMap<PAL::SessionID, WebsiteDataStore*>> map;
return map;
}
Apparently Webkit/GTK does not like the fact that the unit test ran on a different thread than the atexit handler. We can confirm this theory: A unit test containing
gtk4::init().unwrap();
MercuryParser::parse("");
will crash with the bt above, but if we put this into fn main
it runs fine. Or does it? If we build the webkit2gtk-5.0 package in debug mode, cargo run
outputs
LEAK: 1 WebProcessPool
and exits with exit code 0. With webkit2gtk-5.0 cargo test
still crashes after the unit tests are completed, but the crash is slightly different:
ASSERTION FAILED: !m_impl || !m_shouldEnableAssertions || m_impl->wasConstructedOnMainThread() == isMainThread()
/usr/src/debug/build/WTF/Headers/wtf/WeakPtr.h(148) : T* WTF::WeakPtr< <template-parameter-1-1>, <template-parameter-1-2> >::operator->() const [with T = IPC::MessageReceiver; Counter = WTF::EmptyCounter]
Thread 1 "news_flash_gtk-" received signal SIGABRT, Aborted.
LEAK: 4 WebCoreNode
__pthread_kill_implementation (threadid=<optimized out>, signo=signo@entry=6, no_tid=no_tid@entry=0) at pthread_kill.c:44
44 return INTERNAL_SYSCALL_ERROR_P (ret) ? INTERNAL_SYSCALL_ERRNO (ret) : 0;
(gdb) bt
#0 __pthread_kill_implementation (threadid=<optimized out>, signo=signo@entry=6, no_tid=no_tid@entry=0) at pthread_kill.c:44
#1 0x00007fffeda8e3d3 in __pthread_kill_internal (signo=6, threadid=<optimized out>) at pthread_kill.c:78
#2 0x00007fffeda3e838 in __GI_raise (sig=sig@entry=6) at ../sysdeps/posix/raise.c:26
#3 0x00007fffeda28535 in __GI_abort () at abort.c:79
#4 0x00007ffff1ab152d in () at /usr/lib/libwebkit2gtk-5.0.so.0
#5 0x00007ffff1ab19b9 in IPC::MessageReceiverMap::invalidate() [clone .cold] () at /usr/lib/libwebkit2gtk-5.0.so.0
#6 0x00007ffff252489a in WebKit::WebProcessPool::~WebProcessPool() () at /usr/lib/libwebkit2gtk-5.0.so.0
#7 0x00007ffff25257dd in WebKit::WebProcessPool::~WebProcessPool() () at /usr/lib/libwebkit2gtk-5.0.so.0
#8 0x00007ffff25ff929 in webkitWebContextDispose(_GObject*) () at /usr/lib/libwebkit2gtk-5.0.so.0
#9 0x00007ffff08b1a64 in g_object_unref (_object=<optimized out>) at ../glib/gobject/gobject.c:3636
#10 g_object_unref (_object=0x7fffe40cd2b0) at ../glib/gobject/gobject.c:3553
#11 0x00007fffeda40ef5 in __run_exit_handlers (status=0, listp=0x7fffedbfe778 <__exit_funcs>, run_list_atexit=run_list_atexit@entry=true, run_dtors=run_dtors@entry=true) at exit.c:113
#12 0x00007fffeda41070 in __GI_exit (status=<optimized out>) at exit.c:143
#13 0x00007fffeda29297 in __libc_start_call_main (main=main@entry=0x5555556c8630 <main>, argc=argc@entry=3, argv=argv@entry=0x7fffffffd5e8) at ../sysdeps/nptl/libc_start_call_main.h:74
#14 0x00007fffeda2934a in __libc_start_main_impl (main=0x5555556c8630 <main>, argc=3, argv=0x7fffffffd5e8, init=<optimized out>, fini=<optimized out>, rtld_fini=<optimized out>, stack_end=0x7fffffffd5d8) at ../csu/libc-start.c:392
#15 0x00005555556c70e5 in _start () at ../sysdeps/x86_64/start.S:115
(gdb)
I am sorry that there are missing debug symbols. Building the debug symbols myself fails as described in “WebKit Debug Symbols” and the debug symbols from https://debuginfod.archlinux.org are for a differently build package. I could take a look at how the debug symbols from https://debuginfod.archlinux.org are build and use this to create working debug symbols. But I chose against it as it would probably have a bad cost/reward ratio. Anyway, we can still see that the crash is in an atexit
function that complains it runs on the wrong thread.
D_GLIBCXX_DEBUG leads to Another Warning
Forget for a second the problems of the previous sections, I found another bug, more or less by chance: If I compile webkitgtk with -D_GLIBCXX_DEBUG
and put
gtk4::init().unwrap();
MercuryParser::parse("");
into fn main
, it still prints LEAK: 1 WebProcessPool
, but an additional warning appears:
/usr/include/c++/12.1.0/bits/stl_heap.h:209:
In function:
constexpr void std::push_heap(_RAIter, _RAIter, _Compare) [with _RAIter
= WebCore::TimerHeapIterator; _Compare =
WebCore::TimerHeapLessThanFunction]
Error: comparison doesn't meet irreflexive requirements, assert(!(a < a)).
Objects involved in the operation:
instance "functor" @ 0x7fff0154ba9f {
}
iterator::value_type "ordered type" {
}
The documentation about the -D_GLIBCXX_DEBUG flag says:
Note that this flag changes the sizes and behavior of standard class templates such as std::vector, and therefore you can only link code compiled with debug mode and code compiled without debug mode if no instantiation of a container is passed between the two translation units.
Compiling everything with this flag would too much work, so I manually added the same check to stl_heap.h
:
if (!(__first == __last || !__comp(*__first, *__first))) {
__builtin_trap();
}
and build webkit2gtk-5.0 with this modified library. Running everything works and it does not trap, so I guess our comparison doesn't meet irreflexive requirements
warning was just an artifact of us compiling one part with -D_GLIBCXX_DEBUG and the other part without it. Right?
Giving Up
So, where was I again? We have a unit tests that makes the exit handler crash, and we have a backtrace. However, heavy-hearted I decided to give up on fixing it. Why? The code crashes because the rust code calls webkit in a way you should not call it. But how are you supposed to call it? Surely it has proper documentation, you might say. You are partially right - webkit has good documentation. But we are not using webkit: There are four different versions of webkit: 1. The main version of webkit from Apple 2. The GTK3 fork of webkit which has a Rust wrapper 3. The GTK4 version of it 4. A custom fork of the GTK4 version Each step down in that list is a step down in the quality of documention. (To be fair, 3. and 4. are nearly identical.) Other people seem to agree with me. The basic usuage example of the GTK3 fork works. The basic usuage example of the GTK4 fork is broken, it is just a copy of the GTK3 example and will not compile with GTK4. Also, take a look at these two programs:
#include "webkit2/webkit2.h"
int main() { auto view = webkit_web_view_new(); }
and
#include "webkit2/webkit2.h"
#include <thread>
int main() {
std::thread helper(webkit_web_view_new);
.join();
helper}
Both of them run fine with a release version of webkit2gtk-5.0, but with a debug version of webkit2gtk-5.0, the first one warns
LEAK: 1 WebProcessPool
LEAK: 1 WebPageProxy
and the second one aborts with
ASSERTION FAILED: Completion handler should always be called
!m_function
/usr/src/debug/build/WTF/Headers/wtf/CompletionHandler.h(59) : WTF::CompletionHandler<Out(In ...)>::~CompletionHandler() [with Out = void; In = {WTF::HashMap<WTF::String, std::unique_ptr<WebKit::DeviceIdHashSaltStorage::HashSaltForOrigin, std::default_delete<WebKit::DeviceIdHashSaltStorage::HashSaltForOrigin> >, WTF::DefaultHash<WTF::String>, WTF::HashTraits<WTF::String>, WTF::HashTraits<std::unique_ptr<WebKit::DeviceIdHashSaltStorage::HashSaltForOrigin, std::default_delete<WebKit::DeviceIdHashSaltStorage::HashSaltForOrigin> > >, WTF::HashTableTraits>&&}]
So, it seems we need to properly destruct our WebView Object. Surely, the documentation can tell us how we should properly destruct it. Nope. There is no Tutorial, no Guide, no usuage example - nothing. I am not looking forward to touching 3 million lines of C++ code with bad API documention, so I gave up. At the time, I was at the GPN20 and everyone ran away once they heard the word “webkit”.
Update: There actually is an example program, called MiniBrowser
in the source tree.
More Failing Unit Tests
I said before that cargo test
reports the success of 8 of 8 unit tests, then aborts. That is true for news_flash_gtk
, but news_flash_gtk
has a dependency called news_flash
. Running the unit tests of news_flash
shows 15 fails, apparently with a common reason.
Using git bisect
, I found that commit 3c840c2b introduced this bug by adding another database migration. I should note here, that if news_flash wants to setup a new database, it sets up an emtpy one, then runs every database migration they ever did to arrive at the current version. A dump of the so constructed database showed the problem:
CREATE TABLE taggings (
NOT NULL REFERENCES "_articles_old"(article_id),
article_id TEXT NOT NULL REFERENCES tags(tag_id),
tag_id TEXT PRIMARY KEY (article_id, tag_id)
);
I should note that no _articles_old
table exists. If I concatenate all database migrations and run them, instead of running them using diesel, the constructed database contains
CREATE TABLE taggings (
NOT NULL REFERENCES articles(article_id),
article_id TEXT NOT NULL REFERENCES tags(tag_id),
tag_id TEXT PRIMARY KEY (article_id, tag_id)
);
(the articles
table exists) and everything works fine. I used diesel_logger to log all SQL queries executed by diesel. But strangely, executing those queries did not reproduce the problem. It turned out that there is a bug in diesel_logger: It does not log batch_execute
queries. This bug is fixed in master, and I made a PR to improve the documentation. Instead of using this logger, I ended up putting raw println!
statements in the diesel source code, to get the output format right. Executing the logged queries did reproduce the problem. The difference between the logged queries and the concatination of all migrations is that diesel wraps every migration in a BEGIN
COMMIT
block. Reducing the log and the concatenation leads to this minimal example:
= ON;
PRAGMA foreign_keys CREATE TABLE article (
id TEXT
);CREATE TABLE other (
id TEXT REFERENCES article(id)
);BEGIN;
=ON;
PRAGMA legacy_alter_table=OFF;
PRAGMA foreign_keysALTER TABLE article RENAME TO old;
COMMIT;
This creates the following database:
PRAGMA foreign_keys=OFF;
BEGIN TRANSACTION;
CREATE TABLE IF NOT EXISTS "old" (
id TEXT
);
CREATE TABLE other (
id TEXT REFERENCES "old"(id)
);
COMMIT;
If BEGIN
and COMMIT
are removed, it creates this database.
PRAGMA foreign_keys=OFF;
BEGIN TRANSACTION;
CREATE TABLE IF NOT EXISTS "old" (
id TEXT
);
CREATE TABLE other (
id TEXT REFERENCES article(id)
);
COMMIT;
Notice the difference in line 7. The reason for the difference can be found in the SQLite docs of PRAGMA foreign_keys:
This pragma is a no-op within a transaction; foreign key constraint enforcement may only be enabled or disabled when there is no pending BEGIN or SAVEPOINT.
One way to fix this is to do a PRAGMA foreign_keys=OFF
before the migrations and a PRAGMA foreign_keys=ON
afterwards. Unfortunately, while I was working on fixing this bug, another fix landed.
Frustrated German Noises
Note to myself: If you are debugging something, debug it on the latest relase and check every day if the bug still exists on master. Or even better: Talk to the other programmers.
Failing Package Build
Now, back to our database problem. This bug only occured when I build newsflash myself, but the version I got with pacman -S newsflash
worked fine. So, lets try to build the newsflash
package myself:
volker ~ Documents cloned newsflash wget https://raw.githubusercontent.com/archlinux/svntogit-community/packages/newsflash/trunk/PKGBUILD
volker ~ Documents cloned newsflash makepkg
...
Running custom install script '/usr/bin/python /home/volker/Documents/cloned/newsflash/src/news_flash_gtk-1.5.1/build-aux/meson_post_install.py'
rm: cannot remove '/home/volker/Documents/cloned/newsflash/pkg/newsflash/build': No such file or directory
==> ERROR: A failure occurred in package().
Aborting...
volker ~ Documents cloned newsflash
< rm -r "$pkgdir"/build
---
>
> # Prior to 3ce5d93, news_flash_gtk/data/meson.build contained the lines
> # gnome.compile_resources(
> # 'symbolic_icons',
> # 'symbolic_icons.gresource.xml',
> # gresource_bundle: true,
> # source_dir: meson.current_build_dir(),
> # install: true,
> # install_dir: join_paths(meson.source_root(), 'data/resources/gresource_bundles')
> # )
> # gnome.compile_resources(
> # 'ui_templates',
> # 'ui_templates.gresource.xml',
> # gresource_bundle: true,
> # source_dir: meson.current_build_dir(),
> # install: true,
> # install_dir: join_paths(meson.source_root(), 'data/resources/gresource_bundles')
> # )
> # This causes symbolic_icons.gresource and ui_templates.gresource to be
> # installed in a weird path, the concatination of DESTDIR and the absolute
> # path of *.gresource in the source tree, to be exact. We want neither those
> # *.gresource files, nor their parent directories in our package. Properly
> # finding those parent directories is tricky so we just delete everything
> # except the usr folder. If you would build this package inside your /usr
> # folder, this would create problems, but I am too lazy to fix that and the
> # next release of news_flash_gtk will contain a fix anyway.
>
> # If e.g. you build the package in a clean chroot, as
> # explained in
> # https://wiki.archlinux.org/title/DeveloperWiki:Building_in_a_clean_chroot ,
> # this will delete "$pkgdir"/build which is
> # /build/newsflash/pkg/newsflash/build . If you e.g. build this package in
> # your home directory, this will delete
> # /home/username/pkgbuildfolder/pkg/newsflash/home
> # which contains
> # /home/username/pkgbuildfolder/pkg/newsflash/home/username/pkgbuildfolder/src/news_flash_gtk-1.5.1/data/resources/gresource_bundles/symbolic_icons.gresource
>
> find "$pkgdir" -mindepth 1 -maxdepth 1 ! -name 'usr' -exec rm -r {} +
Version Trouble
Now, lets run the newest master version of news_flash_gtk: On this version, rendering does not work at all, the article is simple one big white space. (There is no text at all, not even white text as can be confirmed by attempting to highlight it.) git bisect
does not work nicely because there are quite a lot of commits where the build is broken, but at the end I managed to find the following:
- Commit 21f76dc552d419c447ad82edcd295a49673a1e18 works perfect. Even the original bug of whte blockquotes is gone.
- Commit 87696ab2631d90aa1c5cffeb7ec2317668945fc7 does not render anything at all.
Did I say it does not render anything? It does not anything on my desktop, but it works perfectly on my laptop. Is there a single kind of bug we did not encounter in this yak-shaving? Now we have a fucking platform dependent bug! My desktop and laptop are nearly identical Arch Linux machines.
Turns out that this is a known issue with WebKit and Nvidia GPUs.
Finishing Up
This story ends here. As I said in the abstract, nothing interesting, just bug-fixing.
Update
I might have accidentally found the reason for the dwarf errors: I installed the gcc package and the binutils package using the package manager. Then, unrelated to the events of this blog post, I installed something (I think it was a gcc) using make install. I then deleted some, but not all of the binaries installed by make install. The end result was that gcc, objdump, nm, ar, ld … found by which $name
were incompatible versions. I am not sure if this happened before or after the events of this blogpost. I am too lazy to check if building WebKit with debug symbols now works. If you too struggle with dwarf errors on Arch Linux, this script might help you:
sudo pacman -Fy
for el in $(find /usr -type f) ; do
if [ -z "$(pacman -F "$el")" ] ; then
echo "$el was not installed by the package manager. Consider removing it."
fi
done
The same thing presumably lead to linking problems in rust.