I have been using mdb(1), Solaris Modular
Debugger, to help diagnose an issue with vold(1M).
Shortly after startup vold core dumped. So I configured
the system to save the core file using coreadm(1M) and restarted
vold with some additional debug flags. When vold next crashed the
logs it created were not particularly helpful in this case, but
the core file told another story:
$ mdb /var/cores/vold:seth-5361:0-0:1094116587
Loading modules: [ libc.so.1 ld.so.1 ]
> ::status
debugging core file of vold (32-bit) from seth
file: /usr/sbin/vold
initial argv: /usr/sbin/vold -v -t -L 99
threading model: multi-threaded
status: process terminated by SIGSEGV (Segmentation Fault)
> ::regs
%g0 = 0x0000000000000000 %l0 = 0xfef50692
%g1 = 0x000000006e6c2cff %l1 = 0x01010101
%g2 = 0x0000000000007b34 %l2 = 0xff266b00 libc.so.1`_uberdata
%g3 = 0xffffffffffffffff %l3 = 0x00001084
%g4 = 0xfffffffffffff810 %l4 = 0x00001000
%g5 = 0x0000000000000000 %l5 = 0x00000000
%g6 = 0x0000000000000000 %l6 = 0x00000000
%g7 = 0x00000000ff139000 %l7 = 0x01000000
%o0 = 0x00000000ff1e1010 libc.so.1`strlcat %i0 = 0xff30dda8
%o1 = 0x0000000000000041 %i1 = 0xfef50680
%o2 = 0x00000000ff29f2dc libnsl.so.1`parse_default+0xec %i2 = 0xfef50620
%o3 = 0x00000000ff3ea108 ld.so.1`lml_main %i3 = 0x00000000
%o4 = 0x00000000ff3ec644 ld.so.1`dbg_mask %i4 = 0x0000ff00
%o5 = 0x0000000000000000 %i5 = 0x80808080
%o6 = 0x00000000fef4f948 %i6 = 0xfef505b8
%o7 = 0x00000000ff3b36e8 ld.so.1`elf_rtbndr+0x10 %i7 = 0xff2155d0
libc.so.1`snprintf+0x7c
%psr = 0xfe900000 impl=0xf ver=0xe icc=NzvC
ec=0 ef=0 pil=0 s=0 ps=0 et=0 cwp=0x0
%y = 0x00000000
%pc = 0xff21192c libc.so.1`_ndoprnt+4
%npc = 0xff211930 libc.so.1`_ndoprnt+8
%sp = 0xfef4f948
%fp = 0xfef505b8
%wim = 0x00000000
%tbr = 0x00000000
>
mdb has debugger commands, known as dcmds and
walkers: see the man page.
First off, the ::status dcmd shows us a summary of the
current target. A segmentation fault occurs when a process tries to
reference an area of memory (a segment) that has not been allocated
(mapped). The ::regs dcmd shows us the registers when the
fault occurred. The register %pc (Program Counter) points to the
instruction that caused the violation, so lets look at the
instructions around %pc
> <pc-8/4i
libc.so.1`___const_seg_900000505+0x168: unimp 0x808
save %sp, -0xc70, %sp
st %i3, [%sp + 0x6c]
mov %o7, %i5
> <pc ::dis -n 2
libc.so.1`___const_seg_900000505+0x168: unimp 0x808
libc.so.1`_ndoprnt: save %sp, -0xc70, %sp
libc.so.1`_ndoprnt+4: st %i3, [%sp + 0x6c]
libc.so.1`_ndoprnt+8: mov %o7, %i5
libc.so.1`_ndoprnt+0xc: clr %i3
>
Firstly I used the older adb command to disassemble instructions
around the PC, as SPARC instructions are always 4 words its easy to
rewind a little first. However mdb provides the dis dcmd
which nicely backs up a little for you and highlights the starting
address.
Anyhow, I digress… The instructions at the PC show register
%i3 being stored on the stack, %sp offset 0x6c. As the
fault is a SIGSEGV we can presume that %sp+0x6c is non-existent. To
confirm:
> <sp+0x6c=X
fef4f9b4
> <sp+0x6c/X
mdb: failed to read data from target: no mapping for address
0xfef4f9b4:
> ::mappings ! egrep 'fef|SIZE'
BASE LIMIT SIZE NAME
fef00000 fef02000 2000 /usr/lib/smedia/sm_scsi.so.1
fef12000 fef14000 2000 /usr/lib/smedia/sm_scsi.so.1
fef20000 fef22000 2000 /usr/lib/straddr.so.2
fef32000 fef34000 2000 /usr/lib/straddr.so.2
fef40000 fef48000 8000
fef50000 fef58000 8000
fef60000 fef64000 4000 /usr/lib/vold/dev_rmdisk.so.1
fef74000 fef76000 2000 /usr/lib/vold/dev_rmdisk.so.1
fef80000 fef84000 4000 /usr/lib/vold/dev_pcmem.so.1
fef94000 fef96000 2000 /usr/lib/vold/dev_pcmem.so.1
fefb0000 fefb2000 2000 /usr/lib/vold/dev_floppy.so.1
fefc2000 fefc4000 2000 /usr/lib/vold/dev_floppy.so.1
fefd6000 fefd8000 2000
fefe4000 fefe6000 2000
feff6000 feff8000 2000
>
<sp+0x6c=X calculates the address and displays it in
hex. <sp+0x6c/X calculates the address and attempts to
display the data from that location in hex.
Sure enough the stack pointer register seems to be in no mans land.
The mappings dcmd (output abbreviated by grep) confirms that.
So who set %sp to this value? Well that would have been the
save instruction just prior to this store instruction (see
disassemble above). Now the save instruction causes a new stack to be
popped on top (lowest address) of the current stack. The SPARC save
instruction places the old %sp in %o6 and the new
%sp in %i6 (in fact %sp is a pseudo for %i6).
So Why was my stack limited to only 0x8000 (32,768):
32k?Good question to which I thank Nobutomo Nakano for his
assistance in helping me dig deeper…
One clue was right at the very beginning threading model:
multi-threaded. Another is at the bottom (highest address) of the stack:
> ::stack
libc.so.1`_ndoprnt+4(ff30dda8, fef50680, fef50620, 0, ff00, 80808080)
libc.so.1`snprintf+0x7c(fef50e90, 800, ff30dda8, fef54150, 59048, f)
libnsl.so.1`parse_default+0x370(fef54150, 0, 694c0, 7e, 59048, 800)
libnsl.so.1`parse_path+0xc4(fef54150, fef516f0, 5a5a8, 694b8, 80, 3)
libnsl.so.1`__nis_getnames+0xfc(fef54150, fef549e4, 5a5a8, 694b8, 694b8,
ffbff415)
libnsl.so.1`nis_list+0x2a8(fef54a48, 10043, 0, 0, fef54a60, 1)
nss_nisplus.so.1`_nss_nisplus_expand_lookup+0x168(7ead8, fef54f78, fee86b3c,
fef575b9, fee86b44, 0)
nss_nisplus.so.1`getbyname+0x2c(7ead8, fef54f78, ff266b00, 15684, ff1d8364,
fee98000)
libc.so.1`nss_search+0x1d4(fee8296c, ff26573c, ff26a7cc, fef54f78, 1, ff262228)
libnsl.so.1`_switch_gethostbyname_r+0x50(fef575b9, 7c99c, 7c9b0, 2120, fef57088
, ff2b81d4)
libnsl.so.1`_door_gethostbyname_r+0x108(fef575b9, 7c99c, 7c9b0, 2120, fef57088,
fef575b9)
libnsl.so.1`_get_hostserv_inetnetdir_byname+0xac0(2, fef57138, fef57130, 0,
fef570ac, 0)
libnsl.so.1`netdir_getbyname+0x84(56f48, fef571bc, fef571b8, ff295aa8, 760c0,
ff22e73c)
libnsl.so.1`_getclnthandle_timed+0x198(fef575b9, 56f48, fef5728c, ff314f00, 0, 0
)
libnsl.so.1`__rpcb_findaddr_timed+0x134(1873b, 1, 56f48, fef575b9, fef57304,
ff314f00)
libnsl.so.1`clnt_tp_create_timed+0x5c(fef575b9, 1873b, 56f48, 73cb0, 0, ff29cc90
)
libnsl.so.1`clnt_create_timed+0x198(fef575b9, 1873b, 1, 0, 0, 516a8)
libsmedia.so.1`is_server_running+0x38(58a28, 10ea0, 1234, ff3515cc, ff110580,
ff362000)
libsmedia.so.1`get_handle_from_fd+0x10c(17, 0, 58a28, 46008, d, ff362000)
dev_reset_symname+4(66760, 17, 666b0, 800017, 1c, 7)
dev_rmdisk.so.1`rmdisk_thread_wait+0x26c(66760, 666ec, 1, fef62a00, 8, 2)
libc.so.1`_lwp_start(0, 0, 0, 0, 0, 0)
>
A stack belonging to _lwp_start(): vold(1M) is threaded and
the first argument to thr_create(3C) is the stack address. Now we could use
truss(1M) or (for those using Solaris Express) dtrace(1M) to see
thr_create() being used. But we need not as mdb (bless its
cotton socks) has a some walkers to help:
> $l
5
> ::walk thread
1
2
4
5
> ::walk ulwps
ff138000
ff138400
ff138c00
ff139000
>
“$l” Displays the thread number. “::walk thread”
displays a list of valid threads. “::walk ulwps lists ulwp_t
pointers.
I can sense that you are not impressed! But wait… Look at the
man page to libthread(3LIB). Notice a man page for thr_stksegment(3C)
which states “The thr_stksegment() function returns, in its
stack_t argument, the address and size of the calling threads
stack.” meaning that its got to be stored (or calculated)
somewhere (or somehow). Well it would make sense to have that stored
somewhere in the ulwp_t structure… And it is! Thanks to the
walker ulwps we know where that is:
> ff139000 ::print ulwp_t ! grep stk
ul_stk = 0xfef50000
ul_stktop = 0xfef58000
ul_stksiz = 0x8000
>
Using print dcmd we can print the contents of the
ulwp_t data structure.
Again I have used grep to show you only what is important here.
Now remembering that SPARC stacks grow down, the end of the stack for
this LWP is at ul_stk = 0xfef50000, its size is ul_stksiz = 0x8000
(32k) and thus its start is at ul_stktop = 0xfef58000.
So the problem is that this program didn’t allow for the stack to
grow so large. Lets look at the stack and see what made it out
grow:
> ::stackregs 1 ! grep '('
fef4f948 libc.so.1`_ndoprnt+4(ff30dda8)
fef505b8 libc.so.1`snprintf+0x7c(fef50e90)
fef50630 libnsl.so.1`parse_default+0x370(fef54150)
fef51690 libnsl.so.1`parse_path+0xc4(fef54150)
fef540f0 libnsl.so.1`__nis_getnames+0xfc(fef54150)
fef54950 libnsl.so.1`nis_list+0x2a8(fef54a48)
fef549e8 nss_nisplus.so.1`_nss_nisplus_expand_lookup+0x168(7ead8)
fef54e48 nss_nisplus.so.1`getbyname+0x2c(7ead8)
fef54ea8 libc.so.1`nss_search+0x1d4(fee8296c)
fef54f18 libnsl.so.1`_switch_gethostbyname_r+0x50(fef575b9)
fef54fa8 libnsl.so.1`_door_gethostbyname_r+0x108(fef575b9)
fef57018 libnsl.so.1`_get_hostserv_inetnetdir_byname+0xac0(2)
fef570b0 libnsl.so.1`netdir_getbyname+0x84(56f48)
fef57150 libnsl.so.1`_getclnthandle_timed+0x198(fef575b9)
fef571d0 libnsl.so.1`__rpcb_findaddr_timed+0x134(1873b)
fef57298 libnsl.so.1`clnt_tp_create_timed+0x5c(fef575b9)
fef57308 libnsl.so.1`clnt_create_timed+0x198(fef575b9)
fef57398 libsmedia.so.1`is_server_running+0x38(58a28)
fef579c0 libsmedia.so.1`get_handle_from_fd+0x10c(17)
fef57bb0 dev_reset_symname+4(66760)
fef57c28 dev_rmdisk.so.1`rmdisk_thread_wait+0x26c(66760)
fef57fa0 libc.so.1`_lwp_start(0)
>
Here I have used stackregs dcmd as it prints out the stack
pointer at the beginning of each frame. Stackregs provides more
information than I need here, so again I have trimmed it with grep.
To work out where the space went one simply needs to calculate the
differences between the addresses (stack/frame pointers) down the
left.
But hang on a minute. This is mdb(1m), we can copy /usr/demo/mdb
and write our own module to do the calculation for us….
> >fp ::walk simple_stack | ::stack_size
FRAME bytes kilobyte Owner
fef505b8 120 0 0
fef50630 4192 4 libnsl.so.1`parse_default+0x370
fef51690 10848 11 libnsl.so.1`parse_path+0xc4
fef540f0 2144 2 libnsl.so.1`__nis_getnames+0xfc
fef54950 152 0 libnsl.so.1`nis_list+0x2a8
fef549e8 1120 1 nss_nisplus.so.1`_nss_nisplus_expand_lookup+0x168
fef54e48 96 0 nss_nisplus.so.1`getbyname+0x2c
fef54ea8 112 0 libc.so.1`nss_search+0x1d4
fef54f18 144 0 libnsl.so.1`_switch_gethostbyname_r+0x50
fef54fa8 8304 8 libnsl.so.1`_door_gethostbyname_r+0x108
fef57018 152 0 libnsl.so.1`_get_hostserv_inetnetdir_byname+0xac0
fef570b0 160 0 libnsl.so.1`netdir_getbyname+0x84
fef57150 128 0 libnsl.so.1`_getclnthandle_timed+0x198
fef571d0 200 0 libnsl.so.1`__rpcb_findaddr_timed+0x134
fef57298 112 0 libnsl.so.1`clnt_tp_create_timed+0x5c
fef57308 144 0 libnsl.so.1`clnt_create_timed+0x198
fef57398 1576 2 libsmedia.so.1`is_server_running+0x38
fef579c0 496 0 libsmedia.so.1`get_handle_from_fd+0x10c
fef57bb0 120 0 dev_reset_symname+4
fef57c28 888 1 dev_rmdisk.so.1`rmdisk_thread_wait+0x26c
fef57fa0 0 0 libc.so.1`_lwp_start
>
The Modular nature of MDB makes it easy to add your own custom
dcmds and walkers. Note: kilobyte column is rounded to nearest K.
In short the fix was to provide a larger stack for this thread.
I hope this little walk through of mdb has been of some interest if
not educational to some.
Stace