So here is the story; On the server, I've setup dhcpd, bind/named, tftp, drakpxe for PXE boot server (network booting usually enabled in the ROM/BIOS of the client).I recompiled a new Mageia3 kernel from source changing nfs, dhcp, nfsroot from modules to kernel code. Then end result is a bzImage that I put in the tftpboot directory. (I'll give all the hairy details to anyone that would like to here). Bottom line is this part works perfectly. For PXEboot, it only uses that bzImage. It does not use initrd! It shouldn't need to. So the boot works flawlessly.
For the diskless filesystem, I used info from Colin Guther''s website; http://colin.guthr.ie/2012/09/nfs-root- ... ntre-v2-0/
Great resource btw. I do the following;
- Code: Select all
rpm --root /diskless --initdb
urpmi --root /diskless basesystem-minimal
urpmi --root /diskless kernel-server-latest locales-en nfs-utils bash-completion colorprompt openssh-server openssh-clients task-c-devel task-c++-devel
I'm going to leave out the changes to /diskless/etc/fstab passwd, shadow etc... done to the root filesystem, but these are your typical system file changes.
I do modify /diskless/
This part is KEY to my problem. In the server's /etc/exports, I have;
- Code: Select all
/diskless 10.0.0.0/24(ro,no_root_squash,no_subtree_check,async,insecure)
So /diskless should boot as a readonly nfs root filesystem. For a cluster that is what I want. Here is the problem. When the cluster node BOOTS for the FIRST TIME, it starts the whole [OK] process up to the point where it says "mageia3". It follows with a complaint about autofs4, then something about (from memory and is not exact) systemd... assert (closed_fd_id() == 0) failed. systemd; src/sys/util.c 133. The next line is systemd: Freezing system.
I looked up the source code of the error and it looks like it can't close the handle on some filehandle. It's taken 4 long days to isolate this, but if I change the
/etc/exports from ro to rw:
- Code: Select all
/diskless 10.0.0.0/24(rw,no_root_squash,no_subtree_check,async,insecure)
It boots! Without error! After booting that one first time, if I change it back to ro, it will now boot OK, but systemd will start barfing that same error message about assert (close ....... == 0) failed. This time it's annoying but not fatal.
My take away is that something in the boot process changes the root filesystem on first boot. Someone has to have run into this before; and I was wondering if anyone has some idea where I can look. I suspect there is a bug in systemd, but I was wondering if anyone has been able to boot mageia3 as a readonly nfs root filesystem to a login prompt?
