Bug 548600

Summary: grub crashes/reboots before menu
Product: [openSUSE] openSUSE 11.2 Reporter: Harald Koenig <koenig>
Component: BootloaderAssignee: Torsten Duwe <duwe>
Status: RESOLVED FIXED QA Contact: Jiri Srain <jsrain>
Severity: Critical    
Priority: P2 - High CC: coolo, martin
Version: Final   
Target Milestone: ---   
Hardware: x86-64   
OS: Other   
Whiteboard:
Found By: --- Services Priority:
Business Priority: Blocker: ---
Marketing QA Status: --- IT Deployment: ---
Attachments: y2log

Description Harald Koenig 2009-10-21 00:03:48 UTC
Lenovo T61p notebook:  after update from 11.1 to 11.2 RC1 via DVD, grub crashes and reboots before showing the boot menu.

I've tried to re-install grub via resue system.  grub-install.unsupported worked without error, butstill grub breaks. 

now I'm using grub from 11.1 (still installed in another partition) to boot 11.1 RC1 :-(
Comment 1 Stephan Kulow 2009-10-21 09:24:14 UTC
I wonder why it would fail for you and work for everyone else.
Comment 2 Torsten Duwe 2009-10-21 09:31:34 UTC
Harald, come on, give us some details!
http://en.opensuse.org/Bugs/grub

Maybe some partitioning problem? Or the graphics?
Comment 3 Harald Koenig 2009-10-21 10:00:33 UTC
ok, more details:

the notebook boots fine with 11.1 (and before 11.0)

I just created a nwe boot partition and LV for root with copies from the 11.1 system for update/RC1 test. the system did boot on the 11.1 "clone" before updating.

system boots again after grub-install from 11.1.   this all does not look like typical configuration problems.  but who knows...
anyway, there should not be any reason for grub just to reset/reboot and not show an useful error message and stop/wait.


# /sbin/fdisk -l /dev/sda

Disk /dev/sda: 500.1 GB, 500107862016 bytes
255 heads, 63 sectors/track, 60801 cylinders
Units = cylinders of 16065 * 512 = 8225280 bytes
Disk identifier: 0x4d15b664

   Device Boot      Start         End      Blocks   Id  System
/dev/sda1   *           1       52967   425457396    5  Extended
/dev/sda2           52968       54273    10490445   83  Linux
/dev/sda3           54274       55579    10490445   83  Linux
/dev/sda4           55580       60801    41945715    7  HPFS/NTFS
/dev/sda5               1         132     1060227   83  Linux
/dev/sda6             133         264     1060258+  83  Linux
/dev/sda7             265         918     5253223+  82  Linux swap / Solaris
/dev/sda8             919       40082   314584798+  8e  Linux LVM
/dev/sda9           40083       52967   103498731   83  Linux

# cat /boot/grub/device.map
(hd0)   /dev/sda

# grub-install.unsupported /dev/sda
Installation finished. No error reported.
This is the contents of the device map /boot/grub/device.map.
Check if this is correct or not. If any of the lines is incorrect,
fix it and re-run the script `grub-install'.

(hd0)   /dev/sda


# grub-install


    GNU GRUB  version 0.97  (640K lower / 3072K upper memory)

 [ Minimal BASH-like line editing is supported.  For the first word, TAB
   lists possible command completions.  Anywhere else TAB lists the possible
   completions of a device/filename. ]
grub> setup --stage2=/boot/grub/stage2 (hd0) (hd0,5)
 Checking if "/boot/grub/stage1" exists... yes
 Checking if "/boot/grub/stage2" exists... yes
 Checking if "/boot/grub/e2fs_stage1_5" exists... yes
 Running "embed /boot/grub/e2fs_stage1_5 (hd0)"...  20 sectors are embedded.
succeeded
 Running "install --stage2=/boot/grub/stage2 /boot/grub/stage1 (hd0) (hd0)1+20 p (hd0,5)/boot/grub/stage2 /boot/grub/menu.lst"... succeeded
Done.
grub> quit



more datails after lunch (which ones ???)
Comment 4 Torsten Duwe 2009-10-21 10:41:59 UTC
/dev/sda1   *           1       52967   425457396    5  Extended

You're purposely asking for trouble here.
Comment 5 Harald Koenig 2009-10-21 12:26:07 UTC
(In reply to comment #4)
> /dev/sda1   *           1       52967   425457396    5  Extended
> 
> You're purposely asking for trouble here.

unfortuneately it's not that easy :-(   
[and since when does GRUB in MBR care about the acticve-partition flag... ?!?]


ok, now it looks like this:

# /sbin/fdisk -l /dev/sda

Disk /dev/sda: 500.1 GB, 500107862016 bytes
255 heads, 63 sectors/track, 60801 cylinders
Units = cylinders of 16065 * 512 = 8225280 bytes
Disk identifier: 0x4d15b664

   Device Boot      Start         End      Blocks   Id  System
/dev/sda1               1       52967   425457396    5  Extended
/dev/sda2           52968       54273    10490445   83  Linux
/dev/sda3           54274       55579    10490445   83  Linux
/dev/sda4           55580       60801    41945715    7  HPFS/NTFS
/dev/sda5               1         132     1060227   83  Linux
/dev/sda6   *         133         264     1060258+  83  Linux
/dev/sda7             265         918     5253223+  82  Linux swap / Solaris
/dev/sda8             919       40082   314584798+  8e  Linux LVM
/dev/sda9           40083       52967   103498731   83  Linux

better?   not really.... so lets try all possible compbinations:  using

11.2 grub-install on 11.2 root/boot  grub fails/reboots
11.2 grub-install on 11.1 root/boot  works!

11.1 grub-install on 11.2 root/boot  grub fails/reboots too!
11.1 grub-install on 11.1 root/boot  still works (puh;)


so to me it looks like it's the contents of /dev/sda6 (boot2 for 11.2) ?!

in "fail/reboot mode" grub displays

   GRUB loading stage 1.5   
   GRUB loading, please wait ...

and ~1 second later the screen goes blank and reboot starts over


I've compared the *stage* files in /boot/grub/ and /usr/lib/grub/ and only the file "stage2" differs like this (which is similar to the 11.1 installation where changes in stage2 go from byte 497-550) :

# cmp -l  /usr/lib/grub/stage2 /boot/grub/stage2
   523 377   4
   537 142 147
   538 157 162
   539 157 165
   540 164 142
   542 147 155
   543 162 145
   544 165 156
   545 142 165
   546  57  56
   547 155 154
   548 145 163
   549 156 164
   550 165   0
   668 116  66
   669  77  25
   673 234 204
   674 360 306
   685 354 240


any more ideas what to test/replace/... ?

any grub hooks to get more verbose output and not immediately blank/reboot ?

the end of sda6 (boot for 11.2) ends at byte 2171473920 which is slightly above 2 GB -- any chance that there might be a 2GB limit for some broken BIOS stuff  while loading stage2 ?

I could exchange contents of sda5/sda6 for a test if necessary (but takes more time:(


thanks!
Comment 6 Torsten Duwe 2009-10-21 12:59:04 UTC
Replace the first 100M on your disk with a general boot selector; like a master grub or such. Many colleagues here with multiple installations have done so and would never want to go back again, because it just works so well.

Your setup is hand-made, fragile and hence unsupported. It is probably unrelated to your problem, but do us all a favour and fix that beforehand, nonetheless.

If you can track this down to a coding error and come up with a patch, presumably to yast, we'll be happy to include it. But starting the disk with a big 0x05-type extended is extremely silly, IMHO. I won't even comment on the LVM _inside_ that.

BTW, there is a reason why stage2 is _copied_ to /boot/grub/.
Comment 7 Harald Koenig 2009-10-21 14:06:25 UTC
(In reply to comment #6)
> Replace the first 100M on your disk with a general boot selector; like a master
> grub or such. Many colleagues here with multiple installations have done so and
> would never want to go back again, because it just works so well.

that was exactly my partition setup (for exactly those reasons;) until a recent small change because (a) 100M are not enough for multiple kernels (and meybe even one rescue image to ge booted directly from disk -- very handy at time!) and (b) so I changed from 2*100M boot parts to 2*1G boot parition with a new larger disk.  and I still don't want to beleave that a 1 your old Thinkpad has issues in booting beyond 2GB -- but I'll see...

> Your setup is hand-made, fragile and hence unsupported. It is probably
> unrelated to your problem, but do us all a favour and fix that beforehand,
> nonetheless.

come on, the boot partitions are at the very start of the disk -- no chance to get that any better if there are constrains that boot part shall be 1 GB.


> If you can track this down to a coding error and come up with a patch,
> presumably to yast, we'll be happy to include it. But starting the disk with a
> big 0x05-type extended is extremely silly, IMHO. 

please have a 2nd look at the start sectors of sda5/sda6 !

> I won't even comment on the LVM _inside_ that. 

why? please do so!  I'd like to learn about real problems/issues (not just fud) and see better ways to go...

as long as there are only 4 primary partitions plus at least one crappy window and two boot partitions (plus ...), there are not too may possibilities and at least until now I can't see any indication that the layout choise is bad (modulo 2GB;)


> BTW, there is a reason why stage2 is _copied_ to /boot/grub/.

I have no idea what this grub-install and esp. "/sbin/yast2 bootloader" exactly does. I do not change/touch /boot/grub/stage2 myself!

here is a unedited screen log, which shows that both grub-install and grub-install.unsupported install a modified stage2 (as I see on the working 11.1 boot partition!) but those two versions of stage2 are not identical either:







 
harald harald > md5sum /boot/grub/stage2
abed327ef0a9cb8c683f4a0db6f12d36  /boot/grub/stage2
harald harald > md5sum /usr/lib/grub/stage2
7d5cc2de0f8b78c00b5003cf4a4f10a8  /usr/lib/grub/stage2
harald harald > grub-install


    GNU GRUB  version 0.97  (640K lower / 3072K upper memory)

 [ Minimal BASH-like line editing is supported.  For the first word, TAB
   lists possible command completions.  Anywhere else TAB lists the possible
   completions of a device/filename. ]
grub> setup --stage2=/boot/grub/stage2 (hd0) (hd0,5)
 Checking if "/boot/grub/stage1" exists... yes
 Checking if "/boot/grub/stage2" exists... yes
 Checking if "/boot/grub/e2fs_stage1_5" exists... yes
 Running "embed /boot/grub/e2fs_stage1_5 (hd0)"...  20 sectors are embedded.
succeeded
 Running "install --stage2=/boot/grub/stage2 /boot/grub/stage1 (hd0) (hd0)1+20 p (hd0,5)/boot/grub/stage2 /boot/grub/menu.lst"... succeeded
Done.
grub> quit

harald harald > md5sum /boot/grub/stage2
eecb4a5bcf736ca85a298cb7dee329b7  /boot/grub/stage2
harald harald > md5sum /usr/lib/grub/stage2
7d5cc2de0f8b78c00b5003cf4a4f10a8  /usr/lib/grub/stage2

harald harald > grub-install.unsupported /dev/sda
Installation finished. No error reported.
This is the contents of the device map /boot/grub/device.map.
Check if this is correct or not. If any of the lines is incorrect,
fix it and re-run the script `grub-install'.

(hd0)   /dev/sda
harald harald > md5sum /boot/grub/stage2
abed327ef0a9cb8c683f4a0db6f12d36  /boot/grub/stage2
harald harald > md5sum /usr/lib/grub/stage2
7d5cc2de0f8b78c00b5003cf4a4f10a8  /usr/lib/grub/stage2


multiple runs of grub-install* always produce identical md5sums, so the differences are at least not just a time stamp!

here are the diffs:

# cmp -l /boot/grub/stage2.grub-install /boot/grub/stage2.grub-install.unsupported 
   537 142 147
   538 157 162
   539 157 165
   540 164 142
   542 147 155
   543 162 145
   544 165 156
   545 142 165
   546  57  56
   547 155 154
   548 145 163
   549 156 164
   550 165   0

# cmp -l /usr/lib/grub/stage2 /boot/grub/stage2.grub-install.unsupported
   523 377   4
   537 142 147
   538 157 162
   539 157 165
   540 164 142
   542 147 155
   543 162 145
   544 165 156
   545 142 165
   546  57  56
   547 155 154
   548 145 163
   549 156 164
   550 165   0
   668 116  66
   669  77  25
   673 234 204
   674 360 306
   685 354 240

# cmp -l /usr/lib/grub/stage2 /boot/grub/stage2.grub-install
   523 377   4
   668 116  66
   669  77  25
   673 234 204
   674 360 306
   685 354 240


so next test -- in the rescue system after a grub-reset-loop I just did this on the sda6 boot partition:

# cp /usr/lib/grub/stage2 /boot/grub/stage2
# reboot

and INDEED!!!!!  grub boots into the menu


now the 1M question:  who modifies /boot/grub/stage2 ???

time for you and/or strace (me later;)

PS: do you see any interaction between my "strange" setup and those magic changes in the "copied" stage2 ?

any hints what to test first ?
Comment 8 Torsten Duwe 2009-10-21 16:00:46 UTC
What file system do you use? ext2, 3, or 4?

And you don't need to strace ;-) This bug looks similar to a dreadful XFS oddity we've already had.
Comment 9 Harald Koenig 2009-10-21 18:23:57 UTC
(In reply to comment #8)
> What file system do you use? ext2, 3, or 4?
> 
> And you don't need to strace ;-) This bug looks similar to a dreadful XFS
> oddity we've already had.

it's all ext3 -- no xfs anymore on my notebook;)

why not strace ?  to my surprise at a first glance, strace shows that the yast-edition grub-install does not access/read /usr/lib/grub/stage2 at all, it only reads and modifies /boot/grub/stage2.   haven't looked into the details yet...  

e.g. I'll have a look what happens when there is no stage2 in /boot/...
Comment 10 Harald Koenig 2009-11-15 11:27:11 UTC
I created another copy of my 11.1 working system and did a "clean" update to 11.2 (booted from NET install CD, repo from local disk) -- and it happend again: grub stage2 again was broken and grub crashed/rebooted before menu!  after copying a clean stage2 to /boot/grub/ the system does boot.

these are the "wrong" bytes:

grub > cmp -l stage2.bad stage2
   523   4 377
   668  66 116
   669  25  77
   673 204 234
   674 306 360
   685 240 354

I'll attach y2logs...
Comment 11 Harald Koenig 2009-11-15 11:30:38 UTC
Created attachment 327588 [details]
y2log
Comment 12 Torsten Duwe 2010-02-11 16:45:18 UTC
Assumed fixed by latest OBS check-in that builds with known compiler.
Comment 13 Harald Koenig 2010-06-12 13:15:59 UTC
(In reply to comment #12)
> Assumed fixed by latest OBS check-in that builds with known compiler.

greetings from LinuxTag in Berlin -- it just happend again trying to update a friend's notebook (trying to boot after update to 11.2 with retail DVD 64bit)!!!

so which "latest OBS check-in"s shall I look for ?

this is the output of "cmp -l stage2.bad stage2" in /boot/grub after "fixing": 

   497  44   0
   498 327   0
   499  32   0
   500   2   0
   501 262   0
   504  13   0
   505  13   2
   506 327   0
   507  32   0
   508   2   0
   509  27 337
   523   4 377
   668 342 116
   669  22  77
   673 134 234
   674 304 360
   685 114 354
Comment 14 Torsten Duwe 2010-06-16 10:28:13 UTC
Harald, please stop reopening this bug. 11.2 is done and I'm quite sure this issue was fixed. export CC="gcc-4.1" made the register allocation deterministic, so BIOS bugs are no longer a moving target, and most of them are not affecting grub any more. You may have run into something that shows the same symptom, so please open a new bug against 11.3, considering http://en.opensuse.org/Bugs/grub

And please stop pasting stage2 diffs. This works as designed.
Comment 15 Martin Schröder 2010-06-16 10:58:59 UTC
(In reply to comment #14)
> Harald, please stop reopening this bug. 11.2 is done and I'm quite sure this
> issue was fixed. export CC="gcc-4.1" made the register allocation

Are you telling me that installing 11.2 is not supported anymore?
Comment 16 Torsten Duwe 2010-06-16 11:17:12 UTC
No, it's only hard to retrofit installation media already manufactured ;)

If this problem was common, those media wouldn't have been released.

BTW, rereading this bug, Comments #1, #4 and #9 strike me.

Care to open an 11.3 bug with partitioning data from that friend's notebook?
Comment 17 Martin Schröder 2010-06-16 11:21:23 UTC
(In reply to comment #16)
> Care to open an 11.3 bug with partitioning data from that friend's notebook?

Since I'm that friend: What do you need? fdisk -l?
Comment 18 Harald Koenig 2010-06-16 12:34:33 UTC
Hi Torsten,

(In reply to comment #14)
> Harald, please stop reopening this bug. 11.2 is done and I'm quite sure this
> issue was fixed. export CC="gcc-4.1" made the register allocation
> deterministic, 

but there is no single grub update packate available for 11.2, so that
"fix" does not really "exist" -- no chance to download a fixed/working grub right now...

in comment #12 you write "Assumed fixed by latest OBS check-in that builds with known compiler."
so where/how in OBS can I find this "fixed" grub ?  at LinuxTag a kind opensuse guy from  prague
tried to help us and looked into OBS after reading your comments in this ticket -- he wan't able
to find any newer grub either (Martin, can you rememner his name? tall guy, curly black hair).
pls have a look:

	http://software.opensuse.org/search?baseproject=openSUSE%3A11.2&exclude_debuginfo=true&exclude_filter=home%3A&p=1&q=grub

what's the point of resolving/closing bugs but never releasing any updates ?


> so BIOS bugs are no longer a moving target, and most of them are
> not affecting grub any more. You may have run into something that shows the
> same symptom, so please open a new bug against 11.3, considering
> http://en.opensuse.org/Bugs/grub
> 
> And please stop pasting stage2 diffs. This works as designed.

what exactly "works as designed" ?  the grub-copied stage2 does not work/boot
while the raw-copied stage2 works.  so what's the design idea of those few changed bytes?

> No, it's only hard to retrofit installation media already manufactured ;)

but it's not so hard to release an updated rpm, is it ?


> BTW, rereading this bug, Comments #1, #4 and #9 strike me.

yes, me too ;-)

#1: no idea, but also not my job;)   now I know 2 of 2 11.2 installations -- both fail that way...
   yes. I'm surprised about that statistics too!

#4: can you elaborate on that ? there is no technical point about the "first sectors" of a disk shall be any better or worse for a extended disk (except for sustained troughput which -- which is the very well tought reason for that location btw).
other than that, partitions are just blocks of sectors (== a tupple of start/end or start/size value), no matter if it's a primary or extended partition.  you comment without any explanation or reasoning strikes me;)
and pls remember that grub boot loader got written to MBR, so the active flag etc. does not matter at all...

what about #9 ?



final line: likely with a fixed grub update package everything would be fine (this reopen never was about fixing an existing physical DVD;)