Scott Moser has just announced this week that the new Ubuntu images which boot out of an EBS-based root filesystem in EC2, and thus will persist across reboots, are available for testing.
As usual with something that just left the oven and is explicitly labeled for testing purposes, there was a minor bug in the first iteration of images which was even mentioned in the announcement itself. The bug, if not worked around as specified in the announcement, will prevent the image from rebooting.
Having an bootable EBS image which can’t reboot is a quite interesting (and ironic) problem. You have an image which persists, but suddenly you have no way to see what is inside the image anymore because you can’t boot it. Naturally, even if the said bug didn’t exist in the first place, it’s fairly easy to get into such a situation accidentally if you’re fiddling with the image configuration.
So, in this post we’ll see how to recover from a situation where a bootable EBS image can’t boot.
Getting started
To start this up, we’ll boot one of the EBS images which Scott mentioned in his announcement: ami-8bec03e2. As we see in the output of ec2-describe-images, this is an EBS-based image for i386:
% ec2-describe-images ami-8bec03e2
IMAGE ami-8bec03e2 099720109477/ebs/ubuntu-images-testing/ubuntu-lucid-daily-i386-server-20100305 099720109477 available public i386 machine aki-3fdb3756 ebs
BLOCKDEVICEMAPPING /dev/sda1 snap-f1efd098 15
Let’s run this image. Remember to replace the value passed in the -k command line option with your own key pair name.
% ec2-run-instances -k gsg-keypair ami-8bec03e2
RESERVATION r-9e4615f6 626886203892 default
INSTANCE i-e3e33a88 ami-8bec03e2 pending gsg-keypair 0 m1.small 2010-03-09T20:04:12+0000 us-east-1c aki-3fdb3756 monitoring-disabled ebs
There we go. We got an instance allocated in the availability zone us-east-1c. It’s important to keep track of this information, since EBS volumes are zone-specific.
As part of the above command, we must have been allocated an EBS volume automatically, and it should be attached to the instance we just started. We can investigate it with the ec2-describe-volumes command:
% ec2-describe-volumes
VOLUME vol-edca1684 15 snap-f1efd098 us-east-1c in-use 2010-03-09T20:04:20+0000
ATTACHMENT vol-edca1684 i-e3e33a88 /dev/sda1 attached 2010-03-09T20:04:24+0000
Now, we’ll get into the running instance and do some arbitrary modifications, just as a way to demonstrate that the data we don’t want to lose actually survives the recovering operation. Note that the domain name is obtained with the ec2-describe-instances command.
% ssh -i ~/.ssh/id_dsa_gsg-keypair ubuntu@ec2-184-73-51-147.compute-1.amazonaws.com
(…)ubuntu@domU-12-31-39-0E-A0-03:~$ echo “Important data” > important-data
ubuntu@domU-12-31-39-0E-A0-03:~$ ls -l important-data
-rw-r–r– 1 ubuntu ubuntu 15 Mar 9 20:15 important-dataubuntu@domU-12-31-39-0E-A0-03:~$ sudo reboot
Broadcast message from ubuntu@domU-12-31-39-0E-A0-03
(/dev/pts/0) at 20:18 …
The system is going down for reboot NOW!
Note that we didn’t actually fix the problem reported by Scott, so our machine won’t really reboot. If we wait a while, we can even see that the problem is exactly what was reported in the announcement (note it really takes a bit for the output to be synced up):
% ec2-get-console-output i-e3e33a88 | tail -4
mount: special device ephemeral0 does not exist
mountall: mount /mnt [294] terminated with status 32
mountall: Filesystem could not be mounted: /mnt
Alright, now what? Machine is dead.. and can’t reboot. How do we get to our important data?
Fixing the problem
The first thing we do is to stop the instance. Do not terminate it, or you’ll lose the EBS volume! After stopping it, we’ll detach the EBS volume that was being used as the root filesystem, so that we can attach somewhere else.
% ec2-stop-instances i-e3e33a88
INSTANCE i-e3e33a88 running stopping% ec2-detach-volume vol-edca1684
ATTACHMENT vol-edca1684 i-e3e33a88 /dev/sda1 detaching 2010-03-09T20:04:22+0000
Now, we need to attach this volume in an image which actually boots, so that we can fix it. For this experiment, we’ll pick one of the daily Lucid images, but we could use any other working image really. Just remind that the image must be running in the same availability zone as our previous instance, since the EBS volume won’t be accessible otherwise.
% ec2-run-instances -k gsg-keypair -z us-east-1c ami-b5f619dc
RESERVATION r-967427fe 626886203892 default
INSTANCE i-fd08d196 ami-b5f619dc pending gsg-keypair 0 m1.small 2010-03-09T21:10:11+0000 us-east-1c aki-3fdb3756 monitoring-disabled instance-store% ec2-attach-volume vol-edca1684 -i i-fd08d196 -d /dev/sdh1
ATTACHMENT vol-edca1684 i-fd08d196 /dev/sdh1 attaching 2010-03-09T21:10:51+0000
With the instance running and the EBS root device attached with an alternative device name, we can then login to fix the original problem which prevented the image from booting correctly. In our case, we’ll simply do what Scott suggested in the announcement.
% ssh -i ~/.ssh/id_dsa_gsg-keypair ubuntu@ec2-204-236-194-196.compute-1.amazonaws.com
(…)
$ mkdir ebs-root
$ sudo mount /dev/sdh1 ebs-root
$ sudo sed -i ‘s/^ephemeral0/#ephemeral0/’ ebs-root/etc/fstab
$ sudo umount ebs-root
$ logout
Connection to ec2-204-236-194-196.compute-1.amazonaws.com closed.
Done! Our EBS volume is now correct, and it should boot alright. We’ll detach the volume from the temporary instance we created, and will reattach it back to the old bootable EBS instance which is stopped. Note that we won’t yet terminate the temporary instance, because we may need it in case something else is still wrong, and we are already paying to use it for the hour anyway. We just have to remind ourselves to terminate it once we’re fully done.
% ec2-detach-volume vol-edca1684
ATTACHMENT vol-edca1684 i-fd08d196 /dev/sdh1 detaching 2010-03-09T21:10:51+0000% ec2-attach-volume vol-edca1684 -i i-e3e33a88 -d /dev/sda1
ATTACHMENT vol-edca1684 i-e3e33a88 /dev/sda1 attaching 2010-03-09T21:24:55+0000% ec2-describe-volumes vol-edca1684
VOLUME vol-edca1684 15 snap-f1efd098 us-east-1c in-use 2010-03-09T20:04:20+0000
ATTACHMENT vol-edca1684 i-e3e33a88 /dev/sda1 attached 2010-03-09T21:24:55+0000
Okay! It should all be good now. It’s time to restart our instance, and see if it is working. Note that since you stopped and started the instance, the public domain name most probably has changed, and thus we need to find it out again with ec2-describe-instances once the instance is running.
% ec2-start-instances i-e3e33a88
INSTANCE i-e3e33a88 stopped pending% ec2-describe-instances i-e3e33a88
RESERVATION r-9e4615f6 626886203892 default
INSTANCE i-e3e33a88 ami-8bec03e2 ec2-184-73-72-214.compute-1.amazonaws.com domU-12-31-39-03-B8-21.compute-1.internal running gsg-keypair 0 m1.small 2010-03-09T21:28:43+0000 us-east-1c aki-3fdb3756 monitoring-disabled 184.73.72.214 10.249.187.207 ebs
BLOCKDEVICE /dev/sda1 vol-edca1684 2010-03-09T21:24:55.000Z% ssh -i ~/.ssh/id_dsa_gsg-keypair ubuntu@ec2-184-73-72-214.compute-1.amazonaws.com
(…)
$ cat important-data
Important data$ logout
Connection to ec2-184-73-72-214.compute-1.amazonaws.com closed.
It worked, and our important data is still there!
Don’t forget to kill the temporary instance you’ve used to fix it after you’re comfortable with the result:
% ec2-terminate-instances i-fd08d196
INSTANCE i-fd08d196 running shutting-down
Conclusion
Concluding, in this post we have seen how to fix a bootable EBS machine which can’t actually boot. The technique consists of detaching the volume from the stopped instance, attaching it to a temporary instance, fixing the image, and then reattaching it back to the original image. This back and forth of EBS volumes is quite useful in many circumstances, so keep it in your tool belt.
Wow. That is so elegant and logical and clearly explained. Brilliantly goes through what could be a complex process and makes it obvious.