An 11 Year Old Bug in the macOS popen()

Recently I discovered a bug in the function popen() in macOS’s Libc that results in a segmentation fault. It created lots of headaches during development and testing and I wanted to share this with the community.

How it started 

In a nutshell, the NetBeez Work-From-Home agent runs on Windows, macOS, and Linux and is used to collect information from the host system that captures the user’s network performance experience. Among others, it calls system utilities such as ping, traceroute, iPerf, to collect network performance metrics. It’s a multithreaded application that spawns all these utilities in parallel and independent to each other by using fork() or popen().

We initially developed the agent on Linux and then we ported to Windows and macOS. We noticed that on macOS only, it would crash from time to time, but it wasn’t deterministic. Letting it run for a while (e.g. a whole day) crashes were inevitable. Those are the most fun bugs to go after. Right?

By running it under lldb and examining the call stuck it looked like it was crashing in popen():

(lldb) bt

* thread #97, stop reason = EXC_BAD_ACCESS (code=EXC_I386_GPFLT)
  * frame #0: 0x00007ff803997d22 libsystem_c.dylib`popen + 478
    frame #1: 0x0000000100050c14 nbagentgdb`runHTTPCommand(lookup=0x00007000104d9e08, connect=0x00007000104d9e00, appconnect=0x00007000104d9df8, pretransfer=0x00007000104d9df0, redirect=0x00007000104d9de8, starttransfer=0x00007000104d9de0, total=0x00007000104d9dd8, error=0x00007000104d9dd4, errorMsg="", httpCommand="curl --user-agent \"Mozilla/5.0 (Macintosh; Intel Mac OS X 10_14_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/104.0.0.0 Safari/537.36\" -k --fail --silent --show-error --output /dev/null -w\"1: %{time_namelookup}\n2: %{time_connect}\n3: %{time_appconnect}\n4: %{time_pretransfer}\n5: %{time_redirect}\n6: %{time_starttransfer}\n7: %{time_total}\n\"    'https://princeton.edu' --max-time 5   2>&1") at executeHTTP.c:322:15
    frame #2: 0x000000010004e6fc nbagentgdb`executeHTTP(param=0x0000000108238000) at executeHTTP.c:140:13
    frame #3: 0x00000001000157a6 nbagentgdb`executeTest(param=0x0000000108238000) at utilities.c:914:17
    frame #4: 0x00007ff803a79514 libsystem_pthread.dylib`_pthread_start + 125
    frame #5: 0x00007ff803a7502f libsystem_pthread.dylib`thread_start + 15

How it continued

It didn’t cross my mind that there could be something wrong with the macOS popen(). So, I kept searching and debugging by looking at so many other things. Once I hit a dead end, I decided to post the issue on Apple’s developer forum.

Soon enough I got a response from eskimo (thank you!), and after some back and forth they recommended looking at the open() implementation on Apple’s Open Source Software. Their recommendation was to get the popen.c implementation from github, throw it in my code to be able to debug it internally, and identify what exactly was causing the crash.

WIthin a few hours I identified that the crash was happening on the following line:

SLIST_FOREACH(p, &pidlist, next)
	(void)posix_spawn_file_actions_addclose(&file_actions, p->fd);

By reading the code I noticed that the shared linked list has locking protection a few lines below, but not where the crash was happening:

THREAD_LOCK();                                                                                                                                                                                                                                                        
SLIST_INSERT_HEAD(&pidlist, cur, next);                                                                                                                                                                                                                               
THREAD_UNLOCK();  

Making the following change fixed the issue:

THREAD_LOCK();
SLIST_FOREACH(p, &pidlist, next)
	(void)posix_spawn_file_actions_addclose(&file_actions, p->fd);
THREAD_UNLOCK();

How it ended

I submitted a PR on github to fix the issue, but it hasn’t received any attention. The current implementation of popen() was introduced in Libc-825.24 released in July 2012. The exact previous release, Libc-763.13 has a different implementation of popen() that looks like it locks the shared linked list. That implementation of popen() is the same as the one from Hardened BSD.

I looked at the current FreeBSD implementation of popen() and from what I see the shared linked list is properly protected by locks when accessed.

decoration image

Request a demo now

Spot VPN, ISP, WiFi issues and more with Netbeez

You can share

Twitter Linkedin Facebook

Let's keep in touch

decoration image