Bug 1018262

Summary: Installation failure "cpio: rename" PowerPC multipath openQA test
Product: [openSUSE] openSUSE Tumbleweed Reporter: Michel Normand <normand>
Component: KernelAssignee: Michal Suchanek <msuchanek>
Status: RESOLVED FIXED QA Contact: E-mail List <qa-bugs>
Severity: Major    
Priority: P2 - High CC: aschnell, fdmanana, mchang, msuchanek, okurz
Version: Current   
Target Milestone: ---   
Hardware: PowerPC-64   
OS: Other   
URL: http://openqa.opensuse.org/tests/329391/modules/install_and_reboot/steps/21
Whiteboard:
Found By: --- Services Priority:
Business Priority: Blocker: ---
Marketing QA Status: --- IT Deployment: ---
Attachments: install_and_reboot-y2logs.tar.bz2

Description Michel Normand 2017-01-05 08:46:13 UTC
Created attachment 708670 [details]
install_and_reboot-y2logs.tar.bz2

This bug is created as follow-up of previous boo#1009472
to continue investigation of same error after worker update.
I am using the same Summary:
Installation failure "cpio: rename" PowerPC multipath openQA test

As said below I need help to continue investigation of this problem.

[Build 20170104] openQA test fails in install_and_reboot
## Observation

openQA test in scenario opensuse-Tumbleweed-DVD-ppc64le-install_only_ppc@ppc64le-multipath fails in
[install_and_reboot](http://openqa.opensuse.org/tests/329391/modules/install_and_reboot/steps/21)

## Reproducible

Fails since (at least) Build [20161110](http://openqa.opensuse.org/tests/303570)

## Expected result

Last good: [20161107](http://openqa.opensuse.org/tests/303068) (or more recent)

## Further details

Always latest result in this scenario: [latest](http://openqa.opensuse.org/tests/latest?flavor=DVD&arch=ppc64le&version=Tumbleweed&test=install_only_ppc&distri=opensuse&machine=ppc64le-multipath)



I am appending below the same status from
https://bugzilla.suse.com/show_bug.cgi?id=1009472#c15

I need suggestion to continue investigation as per following status.

Current_Status:
* The failure is specific to disk multipath test and btrfs for TW PowerPC
  the reported error in y2log is "cpio: rename" error
* No failure for Leap 42.2
* Unable to recreate the failure without openQA env.                 
* Not same failure in ext4 FS in place of btrfs.
* The error reported by Yast is any package installation failure
  and the y2log reports a "cpio: rename" error with no error number.
* the "cpio: rename" string is related to error from fsmRename fct in lib/fsm.c:
  Reported by rpm via the zypp traces from libzypp                    
  (for ExternalProgram.cc, Exception.cc, RpmDb.cc)
  the last error is reported by rpm psm.c rpmpsmUnpack fct as error from rpmPackageFilesInstall
  the related string from emsg (output of rpmfileStrerror)
  string "cpio: rename" is build in this rpmfileStrerror by decoding of RPMERR_RENAME_FAILED RC
  Summary of related source lines:  
===
./rpm-4.12.0.1/lib/psm.c:671: fsmrc = rpmPackageFilesInstall(psm->ts, psm->te, psm->files,
===
    fsmrc = rpmPackageFilesInstall(psm->ts, psm->te, psm->files, psm, &failedFile);
    emsg = rpmfileStrerror(fsmrc);
    rpmlog(RPMLOG_ERR,
            _("unpacking of archive failed%s%s: %s\n"),          
            (failedFile != NULL ? _(" on file ") : ""),
            (failedFile != NULL ? failedFile : ""),
            emsg);                  
===
./rpm-4.12.0.1/lib/rpmfi.c:2111:char * rpmfileStrerror(int rc)
./rpm-4.12.0.1/lib/fsm.c:535: static int fsmRename(const char *opath, const char *path)
./rpm-4.12.0.1/lib/rpmarchive.h RPMERR_RENAME_FAILED = -32774,
===
Comment 1 Michel Normand 2017-01-05 08:53:29 UTC
*** Bug 1009472 has been marked as a duplicate of this bug. ***
Comment 2 Arvin Schnell 2017-01-09 10:32:52 UTC
Looks more like a kernel btrfs problem. There have been such cases in
the past, e.g. bug #950178 and bug #963020.
Comment 6 Michel Normand 2017-03-31 19:29:47 UTC
two testcases (not multipath tests) 
previously set with default HDDMODEL=virtio-blk
and forced temporarily with HDDMODEL=scsi-hd are reporting similar problem.
So source of the problem is not only btrfs but also scsi-hd DD.
===
https://openqa.opensuse.org/tests/380110
https://openqa.opensuse.org/tests/380111
===
2017-03-31 14:41:21 <1> install(3004) [zypp++] ExternalProgram.cc(start_program):249 Executing 'rpm' '--root' '/mnt' '--dbpath' '/var/lib/rpm' '-U' '--percent' '--noglob' '--force' '--nodeps' '--' '/mnt/var/cache/zypp/packages/openSUSE-Tumbleweed-20170322-0/suse/ppc64le/perl-5.24.0-5.53.ppc64le.rpm'
2017-03-31 14:41:21 <1> install(3004) [zypp++] ExternalProgram.cc(start_program):412 pid 5357 launched
2017-03-31 14:41:22 <1> install(3004) [zypp++] ExternalProgram.cc(checkStatus):506 Pid 5357 exited with status 1
2017-03-31 14:41:22 <5> install(3004) [zypp] Exception.cc(log):137 RpmDb.cc(doInstallPackage):2043 THROW:    Subprocess failed. Error: RPM failed: error: unpacking of archive failed on file /usr/lib/perl5/5.24.0/unicore/lib/InSC/Cantilla.pl: cpio: rename
2017-03-31 14:41:22 <5> install(3004) [zypp] Exception.cc(log):137 error: perl-5.24.0-5.53.ppc64le: install failed
2017-03-31 14:41:22 <5> install(3004) [zypp] Exception.cc(log):137 
===
Comment 10 Oliver Kurz 2017-05-26 21:53:56 UTC
I see the same again in the same scenario but not in every job. Only about 1/10 runs recently. In before it happened reproducibly in (nearly) every run.
Comment 11 Oliver Kurz 2017-05-26 21:55:30 UTC
Latest y2log shows:

```
2017-05-25 14:12:56 <1> install(3312) [zypp] RpmDb.cc(doInstallPackage):1928 RpmDb::installPackage(/mnt/var/cache/zypp/packages/openSUSE-20170524-0/suse/noarch/kbd-legacy-2.0.3-4.1.noarch.rpm,0x0000000c)
2017-05-25 14:12:56 <1> install(3312) [zypp++] ExternalProgram.cc(start_program):249 Executing 'rpm' '--root' '/mnt' '--dbpath' '/var/lib/rpm' '-U' '--percent' '--noglob' '--force' '--nodeps' '--' '/mnt/var/cache/zypp/packages/openSUSE-20170524-0/suse/noarch/kbd-legacy-2.0.3-4.1.noarch.rpm'
2017-05-25 14:12:56 <1> install(3312) [zypp++] ExternalProgram.cc(start_program):412 pid 4998 launched
2017-05-25 14:12:57 <1> install(3312) [zypp++] ExternalProgram.cc(checkStatus):506 Pid 4998 exited with status 1
2017-05-25 14:12:57 <5> install(3312) [zypp] Exception.cc(log):137 RpmDb.cc(doInstallPackage):2043 THROW:    Subprocess failed. Error: RPM failed: error: unpacking of archive failed on file /usr/share/kbd/keymaps/legacy/include/compose.latin3: cpio: rename
2017-05-25 14:12:57 <5> install(3312) [zypp] Exception.cc(log):137 error: kbd-legacy-2.0.3-4.1.noarch: install failed
2017-05-25 14:12:57 <5> install(3312) [zypp] Exception.cc(log):137 
2017-05-25 14:12:57 <5> install(3312) [zypp] Exception.cc(log):137 
2017-05-25 14:12:57 <1> install(3312) [Ruby] modules/PackageCallbacks.rb:422 DonePackage(error: 3, reason: 'Subprocess failed. Error: RPM failed: error: unpacking of archive failed on file /usr/share/kbd/keymaps/legacy/include/compose.latin3: cpio: rename
error: kbd-legacy-2.0.3-4.1.noarch: install failed
```
Comment 12 Michel Normand 2017-05-31 07:05:17 UTC
I now have the problem also on Leap 42.3 since Build0071 snapshot (1)
with similar cpio rename reported error in y2log except that a non empty error code is reported:
"cpio: rename failed - No space left on device"
(I did not have any error on Leap 42.3 Build0054 (0))

By default the disk space is set to 10GB, If I do a trial with a 40GB then I still have the same reported error ! (2)

I do not know if this new error code could help for investigation.

(0) https://openqa.opensuse.org/tests/399191# 
    Build0054: no failure
    kernel 4.4.62-1
    disk 10GB
(1) https://openqa.opensuse.org/tests/410912#step/install_and_reboot/21
    Build0071: "cpio: rename failed - No space left on device"
    kernel 4.4.68-2
    disk 10GB
(2) https://openqa.opensuse.org/tests/411068#step/install_and_reboot/13
    "cpio: rename failed - No space left on device"
    kernel 4.4.68-2
    disk 40GB
Comment 13 Michel Normand 2017-06-09 13:14:18 UTC
to complet comment #12
now Leap 42.3 openQA 6 tests are failing with same error with Build0083:
https://openqa.opensuse.org/tests/overview?groupid=30&version=42.3&build=0083&distri=opensuse

I do not have access to https://bugzilla.suse.com/show_bug.cgi?id=1039504
But could it be a similar problem ?
Comment 14 Michal Suchanek 2017-06-09 15:22:34 UTC
bug 1039504 is closed as duplicate of bug 1040182 which is the same issue on SLE
Comment 15 Michel Normand 2017-06-09 17:58:40 UTC
to complet comment #12 and comment #13
if I continue  on  Leap 42.3  Build0083 doing a clone_job with FILESYSTEM=ext4
then I generate a job that do not fail.
That confirm the cpio rename error is related to btrfs FS.
===
$/usr/share/openqa/script/clone_job.pl --from https://openqa.opensuse.org 417515 --host https://openqa.opensuse.org FILESYSTEM=ext4  --skip-download
Created job #417938: opensuse-42.3-DVD-ppc64le-Build0083-minimalx@ppc64le
===
https://openqa.opensuse.org/tests/417515# <= btrfs failure
https://openqa.opensuse.org/tests/417938# <= ext4 passed
===
Comment 16 Michel Normand 2017-06-12 07:38:34 UTC
I am changing the priority and severity because now Leap 42.3 openQA tests are failing for ppc64le arch with default btrfs FS as reported by comment #12 comment #13 comment #15

What need to be done to help to isolate and solve this btrfs problem ?
Comment 17 Michal Suchanek 2017-06-12 10:51:27 UTC
There is work underway to fix this bug.

Unfortunately the bug is not reliably reproducible inside QA and is very hard to reproduce outside QA. So finding the bug may take some time.

If you can provide a test case that reproduces the bug without running a full QA installation test that would be helpful.

Also using such test to point out a particular kernel commit that causes the bug or makes it more prominent would be helpful.
Comment 18 Oliver Kurz 2017-06-12 11:21:18 UTC
(In reply to Michal Suchanek from comment #17)
> There is work underway to fix this bug.
> 
> Unfortunately the bug is not reliably reproducible inside QA and is very
> hard to reproduce outside QA. So finding the bug may take some time.

Well, it *is* reproducible within the openQA tests and therefore what I consider "inside QA". https://openqa.opensuse.org/tests/418998 is the latest example from yesterday and the logs explicitly show that it is the same error:

```
2017-06-10 21:45:02 <5> install(3321) [zypp] Exception.cc(log):137 RpmDb.cc(doInstallPackage):2043 THROW:    Subprocess failed. Error: RPM failed: error: unpacking of archive failed on file /usr/share/fonts/100dpi/courO14-ISO8859-10.pcf.gz: cpio: rename
2017-06-10 21:45:02 <5> install(3321) [zypp] Exception.cc(log):137 error: xorg-x11-fonts-7.6-32.1.noarch: install failed
2017-06-10 21:45:02 <5> install(3321) [zypp] Exception.cc(log):137 
2017-06-10 21:45:02 <5> install(3321) [zypp] Exception.cc(log):137 
2017-06-10 21:45:02 <1> install(3321) [Ruby] modules/PackageCallbacks.rb:422 DonePackage(error: 3, reason: 'Subprocess failed. Error: RPM failed: error: unpacking of archive failed on file /usr/share/fonts/100dpi/courO14-ISO8859-10.pcf.gz: cpio: rename
error: xorg-x11-fonts-7.6-32.1.noarch: install failed
```

> If you can provide a test case that reproduces the bug without running a
> full QA installation test that would be helpful.

It might be possible to reproduce the same error by just repeatedly trying to install/uninstall a package using rpm.

Other than this, what is the problem with the "full QA installation test"? Only other alternative I have in mind right now is running a specific subset of "xfstests" but I don't know which one would be feasible.

@Michel Normand: Maybe you can try out to run xfstests in an environment similar to the one that fails here?

> Also using such test to point out a particular kernel commit that causes the
> bug or makes it more prominent would be helpful.

In case no one did that yet I recommend to check the kernel version differences between the first failed and the last good and then look into the changelog to identify submit requests and commits correspondingly.
Comment 19 Michal Suchanek 2017-06-12 14:17:26 UTC
And about half of the tests succeed for recent builds and most of them for Build20170527. That is what I call not reliably reproducible.(In reply to Oliver Kurz from comment #18)
> (In reply to Michal Suchanek from comment #17)
> > There is work underway to fix this bug.
> > 
> > Unfortunately the bug is not reliably reproducible inside QA and is very
> > hard to reproduce outside QA. So finding the bug may take some time.
> 
> Well, it *is* reproducible within the openQA tests and therefore what I
> consider "inside QA". https://openqa.opensuse.org/tests/418998 is the latest
> example from yesterday and the logs explicitly show that it is the same
> error:
> 
> ```
> 2017-06-10 21:45:02 <5> install(3321) [zypp] Exception.cc(log):137
> RpmDb.cc(doInstallPackage):2043 THROW:    Subprocess failed. Error: RPM
> failed: error: unpacking of archive failed on file
> /usr/share/fonts/100dpi/courO14-ISO8859-10.pcf.gz: cpio: rename
> 2017-06-10 21:45:02 <5> install(3321) [zypp] Exception.cc(log):137 error:
> xorg-x11-fonts-7.6-32.1.noarch: install failed
> 2017-06-10 21:45:02 <5> install(3321) [zypp] Exception.cc(log):137 
> 2017-06-10 21:45:02 <5> install(3321) [zypp] Exception.cc(log):137 
> 2017-06-10 21:45:02 <1> install(3321) [Ruby] modules/PackageCallbacks.rb:422
> DonePackage(error: 3, reason: 'Subprocess failed. Error: RPM failed: error:
> unpacking of archive failed on file
> /usr/share/fonts/100dpi/courO14-ISO8859-10.pcf.gz: cpio: rename
> error: xorg-x11-fonts-7.6-32.1.noarch: install failed
> ```

And about half of the tests succeed for recent builds and most of them for Build20170527. That is what I call not reliably reproducible.

> 
> > If you can provide a test case that reproduces the bug without running a
> > full QA installation test that would be helpful.
> 
> It might be possible to reproduce the same error by just repeatedly trying
> to install/uninstall a package using rpm.

Yes, it *might*. But nobody reproduced it that way so far. So if you have exact steps that lead to the error with reasonable probability go ahead and share them.

> 
> Other than this, what is the problem with the "full QA installation test"?

That it happens after a lengthy process on a virtual machine somewhere in QA which is trashed after the test rather than on a developer machine where the state of the system can be analyzed after the error.

> Only other alternative I have in mind right now is running a specific subset
> of "xfstests" but I don't know which one would be feasible.

Or some tar or cpio benchmarks come to mind, yes.
Comment 20 Michel Normand 2017-06-15 12:45:48 UTC
FYIO, as a bypass I added in openQA a retry of packages install (1),
retry that allow to complete the Leap 42.3 ppc64le Build0089.

(1) https://openqa.opensuse.org/tests/421918#step/install_and_reboot/3
Comment 21 Michel Normand 2017-06-16 07:20:39 UTC
(In reply to Michel Normand from comment #20)
> FYIO, as a bypass I added in openQA a retry of packages install (1),
> retry that allow to complete the Leap 42.3 ppc64le Build0089.
> 
> (1) https://openqa.opensuse.org/tests/421918#step/install_and_reboot/3

Similarly same bypass working also for TW last 20170615 snapshot (ppc64/ppc64le)
https://openqa.opensuse.org/tests/422452#step/install_and_reboot/3
https://openqa.opensuse.org/tests/422451#step/install_and_reboot/3
Comment 22 Michel Normand 2017-06-28 13:05:14 UTC
(In reply to Michal Suchanek from comment #19)
> And about half of the tests succeed for recent builds and most of them for
> Build20170527. That is what I call not reliably reproducible.
> ...[CUT]...

With Last Leap 42.3 Build 0101 the failure is reproducible on trial as per two exemples (1) and (2).
There were some btrfs disk capacity captured for similar bug #1039504 (I do not have access to this bug, could you add me in cc ?) as detailed in (3)
Would that data capture is sufficient and if not, what need to be added ?

Note that (1) and (2) are clone_job with increased HDDSIZEGB as per (4)

(1) https://openqa.opensuse.org/tests/433628#step/install_and_reboot/6 (DVD) 
(2) https://openqa.opensuse.org/tests/433630#step/install_and_reboot/6 (NET)
(3) https://github.com/os-autoinst/os-autoinst-distri-opensuse/commit/22add07cf40044352acf5e846e774bfb317248ba
(4) ====
$/usr/share/openqa/script/clone_job.pl --from https://openqa.opensuse.org 433351 --host https://openqa.opensuse.org HDDSIZEGB=20 BETA=1 --skip-download
Created job #433628: opensuse-42.3-DVD-ppc64le-Build0101-minimalx@ppc64le -> https://openqa.opensuse.org/t433628
===
$/usr/share/openqa/script/clone_job.pl --from https://openqa.opensuse.org 433343 --host https://openqa.opensuse.org HDDSIZEGB=20 BETA=1 --skip-download
Created job #433630: opensuse-42.3-NET-ppc64le-Build0101-minimalx@ppc64le -> https://openqa.opensuse.org/t433630
===
Comment 23 Michel Normand 2017-06-30 09:01:30 UTC
as per https://bugzilla.suse.com/show_bug.cgi?id=1040182#c129
wait for related kernel patch (1) rebuild for Leap 42.3
not yet in iso Build0102 as per bad openQA result (2)

(1) http://kernel.opensuse.org/cgit/kernel-source/commit/?h=openSUSE-42.3&id=8bf31dae2ad1a3c5471841801bc4f12233e3c2ec
(2) https://openqa.opensuse.org/tests/435028#step/install_and_reboot/6
Comment 28 Michal Suchanek 2017-10-26 16:32:11 UTC
Seems this has not happened in past month so closing.

There were fixes that went into the btrfs kernel driver to address this.
Comment 29 Michel Normand 2017-10-26 17:40:28 UTC
ok to close as not anymore failures in TW openQA runs
Will check in next Leap 15 when available.
Comment 34 openQA Review 2021-06-01 05:21:53 UTC
This is an autogenerated message for openQA integration by the openqa_review script:

This bug is still referenced in a failing openQA test: mru-install-multipath-remote
https://openqa.suse.de/tests/6145303

To prevent further reminder comments one of the following options should be followed:
1. The test scenario is fixed by applying the bug fix to the tested product or the test is adjusted
2. The openQA job group is moved to "Released"
3. The label in the openQA scenario is removed
Comment 35 openQA Review 2022-01-14 00:01:02 UTC
This is an autogenerated message for openQA integration by the openqa_review script:

This bug is still referenced in a failing openQA test: create_hdd_tumbleweed_kde
https://openqa.opensuse.org/tests/2134907

To prevent further reminder comments one of the following options should be followed:
1. The test scenario is fixed by applying the bug fix to the tested product or the test is adjusted
2. The openQA job group is moved to "Released" or "EOL" (End-of-Life)
3. The bugref in the openQA scenario is removed or replaced, e.g. `label:wontfix:boo1234`
Comment 36 openQA Review 2022-02-13 23:58:28 UTC
This is an autogenerated message for openQA integration by the openqa_review script:

This bug is still referenced in a failing openQA test: offline_sles15sp1_ltss_media_basesys-srv-desk-dev-contm-lgm-py2-wsm_all_full_x11
https://openqa.suse.de/tests/8150203

To prevent further reminder comments one of the following options should be followed:
1. The test scenario is fixed by applying the bug fix to the tested product or the test is adjusted
2. The openQA job group is moved to "Released" or "EOL" (End-of-Life)
3. The bugref in the openQA scenario is removed or replaced, e.g. `label:wontfix:boo1234`