who can you trust your data with? WD? SG? Even Windows is no good.

It’s to the point that I don’t know who I can trust with my data.  A while back I encountered the famous Seagate firmware bug in their barracuda disk but managed to have rescued the disk.  Then recently I put together a new computer, and threw in 2 brand new 3TB drives, one Western Digital Green Caviar, and the other Seagate Barracuda 3TB (wait… didn’t I just have a problem with Barracuda?).  Because both disks were on sale so I thought what the heck, I will just use these two and will be set for a while now.

Yeah, I wish.

After about 3 months, both drives failed almost simultaneously!  I started to hear some scratchy noises, and then all of a sudden, the very last partition on each disk would show up as RAW!  (I partitioned each disk to have 5 500GB partitions and 1 remaining with whatever the size it is).  Paranoid, I ran chkdsk to see if the disks were okay, but to my surprise, chkdsk showed that both disks had hundreds of GB of bad sectors!  Imagine that!  Even if I opened up the disk and used a pencil to draw rings on the disk surface, it would take me a while to screw up hundreds of GB worth of sectors, but they just showed up in them like that.

So I used the advanced replacement program from both brands in order to copy whatever data is left on the old drive to the new one, and then DBAN the old ones and send them back to the companies.  Simple as that, aside from grieving for the lost data.

Again, I wish.

Today I just received the WD 3TB disk.  I stuck it into the Windows7 64-bit machine, but it would only be recognized by the Computer Management Console as having some 746GB, but the BIOS shows the disk as 3TB.  Googled around a bit and it seems many people suffered the same problem, and some of the solutions including updating the driver and Intel Rapid Storage Technology.  I tried to update the driver, but it says the best driver was already install, and my box is an AMD box!  I even tried ASRock’s 3TB+ unlock utility, but still to no avail.  Of course, there are other “solutions” such as getting a PCI/PCI-e disk controller card, but my PCI slots are already occupied.

So how the heck did the previous but failed 3TB disk was recognized as being 3TB just fine but this one would not work? Then it occurred to me, for the previous disk, I stuck it into a Macally SATA enclosure and partitioned and formatted it on a, get this, 32-bit Windows Vista box!  So I took the disk out and stuck it into the enclosure and booted up my 32-bit Vista box.  Lo and behold, it shows the disk as having 2794.39GB! (why not 3TB you ask? Because when the manufacture talks about TB, they are talking about 1000 x 1000 x 1000 x 1000 or 10 to the 12th power, instead of 1024 x 1024 x 1024 x 1024, so if you divide 3,000,000,000,000 which is what manufacture calls 3TB by (1024 x 1024 x 1024) you will get something close to 2793.96GB).

Well I am not claiming that Windows Vista 32-bit is able to recognize the 3TB disk.  In fact, I think all credits should go to the wonderful Macally SATA enclosure.  I could’ve but I didn’t try it on the Windows 7 64-bit machine with the enclosure because formatting the disk takes a long time and I really need to scrape the data off the failed drive.

But the more interesting twist is that, if my memory serves me right, while the Caviar green 3TB drive was partitioned using the enclosure because at that time, I had yet received the new AMD box yet, the Seagate 3TB Barracuda was actually partitioned and formatted inside the box itself.  So I am not sure what the deal was — whether Windows was indeed having trouble with WD 3TB, but worked well with SG 3TB, or the WD 3TB also being in the box influenced Windows to recognize all other 3TB disks.  With the Seagate replacement still on the way, I cannot test it right now.

But one thing is for certain.  I don’t trust neither WD nor SG disks any more.  And based on my experiences, Hitachi disks are not that great either.  So I am not sure which disk I can trust now, since SSDs are still very low in capacity.  Maybe I should just buy a bunch of pencils and paper and start writing down the bits.



How to clear OS cache

Modern computers are equipped with lots of memory, and operating systems utilizes the free memory space to cache things for faster access, such as inodes and files on disk.  This is great for day to day uses because the caching make things faster, unless you are an “experimental computer scientist” who often carries out serious performance tests and the OS cache would just get in the way messing up your timing information, and many of you bite the bullet by rebooting the machine each time, that is, if you have the privilege to do so, and your performance test would take much longer to finish.

Under Linux, you don’t need to reboot the machine.  You can use the following command chain to clear the OS cache (but you still need sudo access to the following command chain):

> sudo su
> sync; echo 3 > /proc/sys/vm/drop_caches

sync is to make sure all dirty buffers are flushed. writing 3 to /proc/sys/vm/drop_caches is to clear everything: pagecaches, directory entries (or dentries), and inodes. You may also choose to clear only pagecaches using “echo 1”, or clear dentries and inodes using “echo 2”.

hbase not-so-quick start

I wanted to play with Apache HBase so I downloaded v0.94.2 to a ubuntu VirtualBox and followed the quick start.  But it didn’t start at all.  The log file had exceptions similar to the following:

2012-11-15 09:37:28,728 INFO org.apache.hadoop.ipc.HBaseRPC: Server at localhost/ could not be reached after 1 tries, giving up.
 2012-11-15 09:37:28,732 WARN org.apache.hadoop.hbase.master.AssignmentManager: Failed assignment of -ROOT-,,0.70236052 to localhost,40408,1352990244709, trying to assign elsewhere instead; retry=0
 org.apache.hadoop.hbase.client.RetriesExhaustedException: Failed setting up proxy interface org.apache.hadoop.hbase.ipc.HRegionInterface to localhost/ after attempts=1
 at org.apache.hadoop.hbase.ipc.HBaseRPC.handleConnectionException(HBaseRPC.java:291)
 at org.apache.hadoop.hbase.ipc.HBaseRPC.waitForProxy(HBaseRPC.java:259)
 at org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation.getHRegionConnection(HConnectionManager.java:1313)
 at org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation.getHRegionConnection(HConnectionManager.java:1269)
 at org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation.getHRegionConnection(HConnectionManager.java:1256)
 at org.apache.hadoop.hbase.master.ServerManager.getServerConnection(ServerManager.java:550)
 at org.apache.hadoop.hbase.master.ServerManager.sendRegionOpen(ServerManager.java:483)
 at org.apache.hadoop.hbase.master.AssignmentManager.assign(AssignmentManager.java:1640)
 at org.apache.hadoop.hbase.master.AssignmentManager.assign(AssignmentManager.java:1363)
 at org.apache.hadoop.hbase.master.AssignmentManager.assign(AssignmentManager.java:1338)
 at org.apache.hadoop.hbase.master.AssignmentManager.assign(AssignmentManager.java:1333)
 at org.apache.hadoop.hbase.master.AssignmentManager.assignRoot(AssignmentManager.java:2212)
 at org.apache.hadoop.hbase.master.HMaster.assignRootAndMeta(HMaster.java:632)
 at org.apache.hadoop.hbase.master.HMaster.finishInitialization(HMaster.java:529)
 at org.apache.hadoop.hbase.master.HMaster.run(HMaster.java:344)
 at org.apache.hadoop.hbase.master.HMasterCommandLine$LocalHMaster.run(HMasterCommandLine.java:220)
 at java.lang.Thread.run(Thread.java:722)
 Caused by: java.net.ConnectException: Connection refused
 at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method)
 at sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:692)
 at org.apache.hadoop.net.SocketIOWithTimeout.connect(SocketIOWithTimeout.java:206)
 at org.apache.hadoop.net.NetUtils.connect(NetUtils.java:489)
 at org.apache.hadoop.hbase.ipc.HBaseClient$Connection.setupConnection(HBaseClient.java:416)
 at org.apache.hadoop.hbase.ipc.HBaseClient$Connection.setupIOstreams(HBaseClient.java:462)
 at org.apache.hadoop.hbase.ipc.HBaseClient.getConnection(HBaseClient.java:1150)
 at org.apache.hadoop.hbase.ipc.HBaseClient.call(HBaseClient.java:1000)
 at org.apache.hadoop.hbase.ipc.WritableRpcEngine$Invoker.invoke(WritableRpcEngine.java:150)
 at $Proxy12.getProtocolVersion(Unknown Source)
 at org.apache.hadoop.hbase.ipc.WritableRpcEngine.getProxy(WritableRpcEngine.java:183)
 at org.apache.hadoop.hbase.ipc.HBaseRPC.getProxy(HBaseRPC.java:335)
 at org.apache.hadoop.hbase.ipc.HBaseRPC.getProxy(HBaseRPC.java:312)
 at org.apache.hadoop.hbase.ipc.HBaseRPC.getProxy(HBaseRPC.java:364)
 at org.apache.hadoop.hbase.ipc.HBaseRPC.waitForProxy(HBaseRPC.java:236)
 ... 15 more
 2012-11-15 09:37:28,732 WARN org.apache.hadoop.hbase.master.AssignmentManager: Unable to find a viable location to assign region -ROOT-,,0.70236052

After trying a few different things, I found this blog post.  The post was using Cloudera releases, but the symptoms were very similar to what I experienced.  It presented two approaches to solving the issue — either commenting out the “ COMPNAME” line in /etc/hosts, or adding a property “-Djava.net.preferIPv4Stack=true” to HADOOP_OPTS in hadoop-env.sh.  I tried the first approach by commenting out all the COMPNAME lines in /etc/hosts.  In my case, there were 2 such lines, one resolved to localhost and the other resolved to evabuntu (the name I gave to my virtual machine).

When I tried to start HBase again, I got a different exception:

2012-11-15 10:06:22,062 ERROR org.apache.hadoop.hbase.master.HMasterCommandLine: Failed to start master
java.lang.RuntimeException: Failed construction of Master: class org.apache.hadoop.hbase.master.HMasterCommandLine$LocalHMasterevabuntu
at org.apache.hadoop.hbase.util.JVMClusterUtil.createMasterThread(JVMClusterUtil.java:134)
at org.apache.hadoop.hbase.LocalHBaseCluster.addMaster(LocalHBaseCluster.java:197)
at org.apache.hadoop.hbase.LocalHBaseCluster.<init>(LocalHBaseCluster.java:147)
at org.apache.hadoop.hbase.master.HMasterCommandLine.startMaster(HMasterCommandLine.java:140)
at org.apache.hadoop.hbase.master.HMasterCommandLine.run(HMasterCommandLine.java:103)
at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
at org.apache.hadoop.hbase.util.ServerCommandLine.doMain(ServerCommandLine.java:76)
at org.apache.hadoop.hbase.master.HMaster.main(HMaster.java:1806)
Caused by: java.net.UnknownHostException: evabuntu: evabuntu
at java.net.InetAddress.getLocalHost(InetAddress.java:1438)
at org.apache.hadoop.net.DNS.getDefaultHost(DNS.java:185)
at org.apache.hadoop.hbase.master.HMaster.<init>(HMaster.java:241)
at org.apache.hadoop.hbase.master.HMasterCommandLine$LocalHMaster.<init>(HMasterCommandLine.java:215)
at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:57)
at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
at java.lang.reflect.Constructor.newInstance(Constructor.java:525)
at org.apache.hadoop.hbase.util.JVMClusterUtil.createMasterThread(JVMClusterUtil.java:131)
... 7 more
Caused by: java.net.UnknownHostException: evabuntu
at java.net.Inet4AddressImpl.lookupAllHostAddr(Native Method)
at java.net.InetAddress$1.lookupAllHostAddr(InetAddress.java:866)
at java.net.InetAddress.getAddressesFromNameService(InetAddress.java:1258)
at java.net.InetAddress.getLocalHost(InetAddress.java:1434)
... 15 more

This time, the exception was much easier to understand.  HBase picked up the name “evabuntu” from /etc/hostname but wasn’t able to resolve the IPv6 because there was no entry for it in /etc/hosts.  So all I had to do was add an IPv6 entry for it in /etc/hosts.

First use ifconfig to find the IPv6

> /sbin/ifconfig

eth0      Link encap:Ethernet  HWaddr 08:00:27:f4:8f:83
inet addr:  Bcast:  Mask:
inet6 addr: fe80::a00:27ff:fef4:8f83/64 Scope:Link
RX packets:66784 errors:0 dropped:0 overruns:0 frame:0
TX packets:38494 errors:0 dropped:0 overruns:0 carrier:0
collisions:0 txqueuelen:1000
RX bytes:86804874 (86.8 MB)  TX bytes:2845363 (2.8 MB)

eth1      Link encap:Ethernet  HWaddr 08:00:27:df:4d:e0
inet addr:  Bcast:  Mask:
inet6 addr: fe80::a00:27ff:fedf:4de0/64 Scope:Link
RX packets:93 errors:0 dropped:0 overruns:0 frame:0
TX packets:92 errors:0 dropped:0 overruns:0 carrier:0
collisions:0 txqueuelen:1000
RX bytes:11166 (11.1 KB)  TX bytes:12582 (12.5 KB)

lo        Link encap:Local Loopback
inet addr:  Mask:
inet6 addr: ::1/128 Scope:Host
RX packets:9652 errors:0 dropped:0 overruns:0 frame:0
TX packets:9652 errors:0 dropped:0 overruns:0 carrier:0
collisions:0 txqueuelen:0
RX bytes:971333 (971.3 KB)  TX bytes:971333 (971.3 KB)

in my case, I wanted to use eth1, but you may pick the IPv6 from any interface.  So I copied and pasted the inet6 address fe80::a00:27ff:fedf:4de0 to /etc/hosts

> cat /etc/hosts

# The following lines are desirable for IPv6 capable hosts
::1     ip6-localhost ip6-loopback
fe00::0 ip6-localnet
ff00::0 ip6-mcastprefix
ff02::1 ip6-allnodes
ff02::2 ip6-allrouters
fe80::a00:27ff:fedf:4de0 evabuntu

and HBase started.

Then I tried the second approach outlined in that blog post, which was to force Java to use IPv4.  The example in the post added the property to HADOOP_OPTS in conf/hadoop-env.sh, but in the case of Apache HBase, it needed to be added to HBASE_OPTS in conf/hbase-env.sh.  But this alone wasn’t enough because as I mentioned earlier, the /etc/hostname says evabuntu, so I also needed to add an entry for evabuntu in /etc/hosts.  This would be the same drill as above, except this time, you pick out the IPv4 assigned to each network interface instead of IPv6.


case note: resurrecting a bricked Seagate Barracuda 7200.11 disk

A few weeks back, my 1TB Seagate Barracuda 7200.11 disk bricked suddenly.  It happened after a reboot.  The computer stuck on POST trying to detect the disk, and the disk activity LED was on steady, but it just would not recognize the disk.  With SATA power cable connected I could feel the disk spinning inside, which told me it was not a mechanical problem, but the computer just would not recognize it no matter what I did (including putting it into an enclosure and plugging into another computer).  It looked very disturbing at first but after some research it turned out to be a common problem with a bug in the firmware and there is a fix for it.  The bug manifests in 2 symtoms: the LBA=0 problem and the BSY problem.  For the former, the disk is recognized by the computer but it shows the capacity as 0, and for the latter, the computer doesn’t detect the disk at all, which is the case I encountered.

The general steps for fixing the BSY problem are roughly the following:

1. Rig a cable, typically with USB on one end to plug into a computer, and 3 wires on the other end to plug into jumper pins (TX, RX, and GND) on the disk.

2. Loosen the PCB (Printed Circuit Board) of the disk with a Torx 6 screw driver.

3. Insert a non-conductive layer between the PCB and the chip on the disk so they no longer contact.

4. Power on the disk.

5. Use a terminal program on the computer to send a couple of  low level commands to the disk to spin it down

6. While the power is still on, remove the non-conductive layer so the PCB and the chip make contact.

7. Use the terminal program to send a few more commands to the disk to spin it up, and erase the S.M.A.R.T. data.

8. power cycle the disk

9. Use the terminal program to re-create the partition data.

HDD-Parts sells a repair kit for $49.99, but I was hoping for some more affordable methods.  There are several Youtube videos showing how it can be done.  Quite a few of them suggest to modify a Nokia CA-42 cable by cutting off the phone connector end and crimping 3 RS232 connector pins to the wires.  I spent $13 for the cable, $9 for a crimper, and $5 for a RS232 DB9 female connector. While it looked easy on the videos, it didn’t work too well for me.   One problem was all these instructions out there used HyperTerminal as the terminal program, but Microsoft stopped bundling it for Windows Vista and later.  This was not a big deal because puTTY works just equally well.  Another problem was that the RS232 DB9 pins I got were way too big for the jumper pins on the disks to fit snuggly and they easily fell off.  What I really need were “jumper headers” as I found out later, but the local Radio Shack was disappointing (the guy worked there swore they didn’t carry any DB9 crimpers until I grabbed one off the shelf and asked him what it was).  But the bigger problem was that Windows (vista and 7) kept using the Prolific USB-to-Serial driver for the cable.  The name of the driver sounded convincing but I just could not get the communication going.  Worse, the videos also instructed to crack open the USB connector so people could tell which wire is which (TX, RX, GND), but the wires were so thin that after a few moves they broke off from the solder, and I had to toss it into the trash since I didn’t feel like buying a soldering set.

Later someone told me to check out the driver on the mini-CD that came with the Nokia CA-42 cable.  Honestly I didn’t even notice the mini-CD until the cable was in the trash can, so I couldn’t verify that driver worked any better.

Luckily, googling a bit more pointed me to the MSFN forum on how to fix the problem, and since it is a forum, it is interactive, which means there are people who could help when we run into weird situations (many many thanks to jaclaz!!).  The forum also pointed me to a fix kit on eBay for $19.99.  The item listing on eBay recommends to use the VCP (Virtual COM port) driver for the kit, but it was buried under a ton of pictures and I literally missed it until jaclaz pointed it out to me (thanks again!!).

The instructions on the forum are great except for one part.  Instead of steps 3 ~ 6 listed above, it suggests readers to practice how to remove and re-attach the PCB board while the power is on.  It runs a high risk of short-circuiting the PCB because it is very easy to drop these tiny metal screws, and if they fell on the wrong spot, the PCB would become an FCB, fried circuit board.  So DON’T do that.  Take steps 3 ~ 6 instead which is much safer.  The instructions also uses HyperTerminal as the example terminal program, but I will show you how to use puTTY instead.

Disconnect the bricked disk from any power source.  Use the Torx 6 screw driver to remove the PCB board from the disk.  You will see a small chip on the disk.  Cut a strip of anti-static bag, or as the videos suggest, cut a strip of some plastic card (don’t use any paper-based cards because they can be torn easily when you try to pull them out later), and cover the chip, and leave enough leads to the right side of the disk for you to grab and pull later.  Re-attach the PCB board, but don’t tighten these screws too much, especially not on the right side where the non-conductive strip is jammed in between so you can pull out the strip.  You will have 1 extra screw left out because it should go into the middle of the chip and the chip is now covered — don’t lose the screw.

The repair kit was easy to use — I plugged the USB connector to my computer, the 3 wires to the disk jumper pins (make sure you connect GND to GND, TX to RX, and RX to TX).  Windows had trouble finding the driver on its own, but I only needed to point it to the VCP driver I downloaded.  Once the driver is properly installed, log into Windows as an administrative user, go to control panel, Device Manager, and locate the serial port device.  Right click on it, go into properties, and change its baud rate from the default 9600 to 38400.  Also note the COM port it is using.  On mine, it is COM4.

I found it easy to do this with a desktop computer.  An eSATA cable from a laptop doesn’t seem to provide enough current to even power on the disk.  Remove the side panel from the tower case so you have direct access to its SATA power cable.  Plug the SATA power to the bricked disk.  You should be able to feel the disk spinning by slightly lifting it with your hands — there is a certain vibration, and also if you try to turn the disk you can feel a drag due to gyroscopic resistance.

Run puTTY.  On the configuration dialog box, make sure to select the radio button that says “Serial”  [1].  Enter the correct COM port number noted earlier in “Serial line” [2], and 38400 in “Speed” [3].  I highly recommend saving this session by providing a meaningful name in “Saved Sessions” [4] and click the “Save” button [5].  After you have done all that, click on “Serial” under “Connection” in the Category tree on the left side [6].  It shows some options for the serial connection.  Change “Flow control” to “None” [7], and click on “Session” on the left side again [8].  Then click on “Save” button one more time to save the session for later use [5].

Click the “Open” button to open the connection.  You will see a blank window.  If everything was done correctly, pressing “Ctrl-Z”  will show you the prompt.

F3 T>

Type the command below followed by <Enter> to go to level 2

F3 T>/2 <Enter>

And your prompt should now change to

F3 2>

Now you need to spin down the disk by typing the command below, but wait for several seconds before hitting <Enter>

F3 2>Z <wait for several seconds before hitting Enter>

And you should see

F3 2>Z <wait for several seconds before hitting Enter>

Spin Down Complete
Elapsed Time 0.135 msecs <the time may vary here>
F3 2>

People have reported (including me) that if you hit <Enter> too soon after the Z command, you may see some error codes such as:

F3 2>Z <Enter immediately after typing Z>

 LED:000000CE FAddr:00280569
 LED:000000CE FAddr:00280569

One guy has even reported that it is enough to just type Z without even hitting <Enter>, and he just back spaced and erased Z after feeling that the disk spinned down. I didn’t try that.

If you successfully spinned down the disk, you are ready for the most important part: keep the power on to the disk and pull out the non-conductive strip you sandwiched between the PCB and the chip earlier. And tighten all the screws that are already in (so you still have the 1 laying around. Don’t worry about this screw for now). Still be careful at this step because you don’t want to accidentally skid the tip of your screw driver on the PCB to fry it. This ensures the PCB provides enough current to the disk motor so it will spin up correctly.

Use the following command to spin up the disk

F3 2>U <Enter>

and if everything was correct, you should see something like

F3 2>U <Enter>

Spin Up Complete
 Elapsed Time 7.093 secs <the time may vary here>
 F3 2>

I was lucky enough to have encountered a problem at this stage because I didn’t tighten the screws at the first attempt, and the motor wasn’t able to draw enough current.  It gave me the following output:

F3 2>U <Enter>

 Error 1009 DETSEC 00006008
 Spin Error
 Elapsed Time 31.324 secs
 R/W Status 2 R/W Error 84150180

If you encountered that, make sure you have tightened the screws and try again.

Once the disk has been spinned up successfully, change to level 1 using the following command:

F3 2>/1 <Enter>

and your prompt should change to

F3 1>

Now reset the S.M.A.R.T. data using the following command:

F3 1>N1 <Enter>

If everyone was correctly done, it would not output anything and only show you another prompt.  However, because of the loose screws, the motor wasn’t spinned up correctly for me during my first attempt, yet I failed to notice the error messages from the U command.  So when I continued on with the N1 command, I got the following output:

F3 1>N1 <Enter>

Unable to load Diag Overlay

If you see this, STOP.  Power off the disk, re-sandwich the non-conductive strip and START FROM THE BEGINNING.  I was bald enough to go on even after seeing the error message and I will tell you what happened in just a bit.

If erasing S.M.A.R.T. was successful, power off the disk also.  Wait for the disk to completely stop (several seconds), and power it back on.  You need to reconnect a terminal session to the disk, press Ctrl-Z.  Now do the last command to re-create partition data (there are 5 commas between the second “2” and the last “22”):

F3 T>m0,2,2,,,,,22 (enter)

This command takes a while to execute. If everything was right, you will eventually see some output like the following:

Max Wr Retries = 00, Max Rd Retries = 00, Max ECC T-Level = 14, Max Certify Rewrite Retries = 00C8

User Partition Format 5% complete, Zone 00, Pass 00, LBA 00004339, ErrCode 00000080, Elapsed Time 0 mins 05 secs

User Partition Format Successful - Elapsed Time 0 mins 05 secs

Now you have your disk back. Power off the disk. Disconnect the COM cable. Put the last screw back in, make sure all screws are tightened, copy all the data from this disk to another disk, and apply the latest firmware from Seagate (perform this as the last step because updating firmware is risky too, you only want to do this after you have copied the data to another disk).
Because I wasn’t paying attention during my first attempt, I tried to re-create the partition data even when the motor spin-up wasn’t successful, and the disk started to give out a horrible “click click click” noise and I thought my disk was doomed for sure. It turned out to be okay, but I wouldn’t recommend such risks.

In any case, if a step fails, you are more than likely need to power off the disk and start from the beginning. If you are uncertain about something, don’t rush head-on first. Go to MSFN and ask jaclaz and all the good folks there first. We want you to be a happy bunny in the basket.

Looking back, I was debating with myself if I should’ve bought the $49.99 repair kit from HDD-parts.com.  Because I ended up paying something close to that price anyways ($13 + $9 + $5 + $19.99).  But I decided it was a good thing I didn’t because the $49.99 repair kit is only half the story — I may not have found the MSFN forum had it not been the failed attempt with the CA-42 cable, and I would still run into the problems I encountered later and there would be no one helping me in these situations and might have mistakenly thought the disk was not rescueable and give up on it.  Again, many many thanks to jaclaz and other folks!!

case notes: migrating our SVN to SourceForge

Recently I endeavored to migrate all modules for a particular project from our SVN to SourceForge.  This is just some notes documenting what I went through along the way, so it may or may not apply to your situation.

Some highlights of our situation:

1. Our SVN is a mess to start out with.  People committing to our SVN have various degrees of experience in terms of using development tools.  They typically blindly follow some online tutorials they can find.  So most modules don’t follow the standard practice of trunk/ tags/ and branches/ and are checked in to the wrong place.

2. Our SVN is a lab-wise SVN containing code modules from all projects.  For this particular project, our protocol is to check in all code modules under a directory bearing the project name, e.g.


3. Our project space on SourceForge is dedicated to this particular project, so after they are migrated over, the modules no longer need to be under the directory bearing the project name.

4.  Our project space on SourceForge was created as a beta (aka Allura) project.  Finding documentation for the Allura project isn’t terribly easy, and the documentation Google finds is usually for the classic projects.  My initial plan was to migrate one module at a time, so each developer could move their stuff over whenever they are ready to commit.  But it didn’t work out that way.  After speaking with Chris Tsai from SourceForge (thanks Chris!), I realized the migration must be done for all modules in a relatively short period of time, and it would require 2 steps.  The first step is migrate from our SVN to an intermediate SourceForge “classic” SVN.  This can be done in an incremental manner (i.e. module-by-module); however, this classic SVN is read-only via HTTP, and therefore is not useable.  Therefore it requires the second step, which is to import the classic SVN to Allura.  But the second step must be done when the intermediate “classic” SVN is ready, because it would wipe out the existing stuff already in Allura, and the Allura SVN doesn’t allow users to import modules.

So to get started, the first thing is to tidy up our original SVN.  This mostly involves creating proper directory structures, moving things around to the correct place, and renaming some modules with more appropriate names.  Along the way, I also zapped a few directories, including target/ and build/ directories that shouldn’t have been checked in in the first place.  I could have used “svn move” and “svn rename” and “svn delete” commands, but instead, I took an easier way by using the Subclipse plugin for Eclipse and performed the tidy-ups from the GUI.

The second step is to do a dump of our SVN.  Doing a full dump (svnadmin dump /path/to/reporoot > fullsvn.dump) is always the safest, but the resulting dump file could be huge.  For example, a full dump of our SVN would be 17GB, and all I need are things pertain to this project.  So a better option is to dump revisions pertain to this project directory, i.e.

svnadmin dump /path/to/reporoot -r 759:1848 > partialsvn.dump

where 759 is the revision number with which the project specific directory was created, and 1848 the latest revision number.  You can find the revision numbers by

svn log svn+ssh://server/path/to/reporoot/COOLPROJECT

and look at the first and the last lines.

However, when I initially did this, I relied on the Subclipse plugin to view the revision history of the directory, but failed to realize by default, it only shows the most recent 25 revision numbers (there are buttons for loading the next 25, as well as loading all, but I didn’t know then), so I used a much higher revision number as the lower bound.  The dump process generated a bunch of warning messages, complaining that some revision references an earlier revision that is not included in the dump.

* Dumped revision 1788.
WARNING: Referencing data in revision 1696, which is older than the oldest
WARNING: dumped revision (1725). Loading this dump into an empty repository
WARNING: will fail.
WARNING: Referencing data in revision 1720, which is older than the oldest
WARNING: dumped revision (1725). Loading this dump into an empty repository
WARNING: will fail.
WARNING: Referencing data in revision 1721, which is older than the oldest
WARNING: dumped revision (1725). Loading this dump into an empty repository
WARNING: will fail.
* Dumped revision 1789.
* Dumped revision 1790.
* Dumped revision 1791.

All I had to do was redo the dump using the earliest revision so all referenced revisions were included.

The third step is to use svndumpfilter to pick out the module(s) I need to a separate dump file.  I could’ve done this by including all modules in one command, but I prefer creating one dump file for each module since I need some additional manipulation afterwards.  Also by having one dump file for each module, it also saves a lot of pain when something goes wrong during the load.

Here it pays to have some intimate knowledge about the modules, especially after things got moved around quite a bit.  Say there is a module that used to be named “module1”, and I decided to rename it to “higgs-boson-capture” as part of the tidy-up process.  I used the following command to separate this module into its own dump:

cat partialsvn.dump | svndumpfilter --drop-empty-revs include COOLPROJECT/module1 COOLPROJECT/higgs-boson-capture > higgsbosoncap.dump

If I didn’t know about the renaming of the module, and only included its latest name, then COOLPROJECT/module1 would not be included into the dump, and when I try to load that into the SourceForge SVN, I would get errors saying file not found as some reference from its previous name, COOLPROJECT/module1.

The option “–drop-empty-revs” seemed like a good idea at that time, because it would produce a leaner dump file.  But later it turned out to be a bad idea, and I will get to that in just a moment.

Recall earlier in bullet points 2 and 3, I mentioned how the path should be changed after they are migrated to SourceForge.  So instead of being COOLPROJECT/higgs-boson-capture, it would simply be higgs-boson-capture.  If I just load the dump file as is, it would still be COOLPROJECT/higgs-boson-capture.  Changing the path boils down to modifying two things in the dump file, Node-path: and Node-copyfrom-path:.

A dump file can be huge, and also can be a mixture of text and binary contents; manually editing these lines would be next to impossible, not to mention most text editors can’t even handle large files.  Folks who are knowledgeable about “awk” and “sed” can probably fix this with relative ease, but I unfortunately don’t know anything about them, so instead, I created a utility in Java for replacing the node paths.  The code is available at http://sourceforge.net/p/azzurri/code-0/6/tree/nodepathreplace/.  It may not be the most efficient code, but gets the job done.

For our particular case, I needed to run the tool twice:  One pass to change COOLPROJECT/module1 to module1, and another pass to change COOLPROJECT/higgs-boson-capture to higgs-boson-capture.  The order doesn’t matter, but you need to make sure all its previous name paths are changed.

The it is just a matter of following instructions to create a shell session on SourceForge (http://sourceforge.net/apps/trac/sourceforge/wiki/Shell%20service) and to load the dump to the SVN (http://sourceforge.net/apps/trac/sourceforge/wiki/Subversion%20import%20instructions).  Of course, you need to SCP the dump file to SourceForge before you can load it — the documentation I found about SourceForge SCP is for uploading files for release (going through the FRS, or File Release Server); for our purpose, we need to SCP the dump file to the current shell session:

# do the SCP from your local server and push the file to sourceforge
scp higgsbosoncap.dump higgs,coolproject@shell.sourceforge.net:/home/users/h/hi/higgs/

change your username and project name accordingly, and also pay attention to the directory names after /home/users/

I did this for a couple of modules and all loaded fine, but then one module gave me an error saying “File not found” when I tried to load it.  I was 100% sure all its past names were included when I did svndumpfilter, yet it complained a file was not found.   The dump file was small enough to be inspected with a text editor, and I noticed the offending revision was making a reference to a file path from a revision that does not exist in the file.  So I went back and re-did the svndumpfilter, but this time, without the “–drop-empty-revs” option.  That indeed solved the problem.

This was a good exercise, but not a fun one as I spent all day working on this on the 4th of July (and the next day), instead of seeing parades and fireworks.

Reset mysql root password

If you have forgotten the root password to a mysql database, it is easy to reset as long as you have permission to kill the existing running mysql instance and restart it directly.

So the first thing is to find the running mysql.  There are multiple ways the instance could be started (e.g. launched automatically from /etc/init.d, or launched manually by someone keying in mysqld_safe –user=mysql), and it may or may not have a .pid file — and even if it had a .pid file, it may be named differently or placed in different locations.  Looking for it from the processes list is the simplest way.

ps -ef |grep -F "mysql"

Use the command above to find the PID of the running mysql, and kill it with -9

kill -9 PID

Launch mysql with the –skip-grant-tables option.  It is recommended to also use –skip-networking so while you are resetting the password, it won’t let anyone else remotely connect to it.

bin/mysqld_safe --skip-grant-tables --skip-networking

log into mysql as root, and this time, it will not ask you for password (how nice!)

mysql -u root

After you are in, reset the root password and exit as follows:

MYSQL> USE mysql;
MYSQL> UPDATE user SET Password=PASSWORD("newAndMemorablePassword") WHERE User="root";

Follow the instructions above to locate the mysql process, and kill it again.

Then finally, start mysql the normal way. If you are using the mysql that was installed using the package management tool that came with your linux distro, you may do any of the following:

cd /etc/init.d
sudo mysql start


sudo /sbin/service mysql start

OR sometimes, service is placed in a different directory

sudo /usr/sbin/service mysql start

However, if you installed mysql without using any package management tools, you probably have to start it manually:

sudo bin/mysqld_safe --user=mysql &

Note: according to mysql’s own instructions, there is supposed to be a safer and preferred way to reset the password, which is to create an init file and launch mysqld_safe by using the –init-file option, and the content of the init file is the “UPDATE…” and “FLUSH…” statements listed above. But this method didn’t seem to be working for me.

Step by step instructions on self-signed certificate and Tomcat over SSL

Creating a self-signed certificate to test Tomcat https is easy, and this article gives you the step-by-step instructions on the following parts:

1. Create a self-signed host certificate using openSSL

2. Create a PKCS12 keystore and convert it to java keystore

3. Configure Tomcat 6 to use https, and redirect http to https

4. Create a Java client to talk to the Tomcat over SSL with the self-signed certificates

Part 1.  Create a self-signed host certificate using openSSL

There are different ways of creating a self-signed certificate, such as using Java keytool.  But I prefer openSSL because the keys and certificates generated this way are more standardized and can be used for other purposes.  The openSSL HOWTO page gives you a lot of details and other information.

1.1 Create a pair of PKI keys

PKI stands for Public Key Infrastructure, which is also known as Asymmetric key pair, where you have a private key and a public key.  The private key is a secret you guard with your honor and life, and the public key is something you give out freely.  Messages encrypted with one can be decrypted with the other.  While generally speaking, given one key, it should be infeasible to derive the other.  However, openSSL makes it so that given a private key, you can easily derive the public key (but not vice versa, otherwise the security is broken).  For this reason, when you generate a key using openSSL, it only gives you a private key.

As a side note, the word asymmetric is really a poor choice.  Once, a security expert was giving a presentation to a roomful of students on PKI, and one of his slides was supposed to have the title “Asymmetric key scheme”, but perhaps it was the fonts he used, or perhaps he made a last-minute typo,  it looked like there was a space between the letter “A” and the rest of the letter.  After that presentation, quite a few naive students began to think that  PKI is a symmetric (WRONG!) key scheme where it should be exactly the opposite — this is probably a less forgivable mistake than blowing up the chemistry lab because someone thinks inflammable means not flammable.

1.1.1 Create a host private key using openSSL

openssl genrsa -out HOSTNAME-private.pem 2048

This private key is 2048 bits long, generated using RSA algorithm, and we choose not to protect it with an additional passphrase because the key will be used with a server certificate.  The name of the private key is HOSTNAME-private.pem where HOSTNAME should be replaced by the name of the machine you intend to host Tomcat.

1.1.2 Derive the public key using openSSL.  This step is not necessary, unless  you want to distribute the public key to others.

openssl rsa -in HOSTNAME-private.pem -pubout  > HOSTNAME-public.pem

1.2 Create a self-signed X509 certificate

openssl req -new -x509 -key HOSTNAME-private.pem -out HOSTNAME-certificate.pem -days 365

Then you will be prompted to enter a few pieces of information, use “.” if you wish to leave the field blank

Country Name (2 letter code) [AU]:US
State or Province Name (full name) [Some-State]:Indiana
Locality Name (eg, city) []:Bloomington
Organization Name (eg, company) [Internet Widgits Pty Ltd]:Cool Org
Organizational Unit Name (eg, section) []:Cool IT
Common Name (eg, YOUR name) []:Cool Node
Email Address []:.

You will now see your host certificate file HOSTNAME-certificate.pem

UPDATE: The field Common Name is quite important here.  It is the hostname of the machine you are trying to certify with the certificate, which is the name in the DNS entry corresponding to your machine IP.

If your machine does not have a valid DNS entry (in other words, doing a nslookup on the IP of your machine doesn’t give you anything), the host certificate probably won’t work too well for you.  If you are only doing some very minimalistic https connection using only the HttpsURLConnection provided by Java, you can probably get by by disabling the certificate validation as outline towards the end of this article; however, if you use other third-party software packages, you will probably get an exception look like the following:

java.io.IOException: HTTPS hostname wrong:  should be <xxx.yyy.zzz>

This is because many security packages would check for things such as URL Spoofing, and when they do a reverse lookup of the machine IP,  but do not yield the same hostname as what is in the certificate, they think something is fishy and throws the exception.

Part 2. Create a PKCS12 keystore and convert it to a Java keystore

Java keytool does not allow the direct import of x509 certificates with an existing private key, and here is a Java import key utility Agent Bob created to get around that.  However, we can still get it to work even without this utility.  The trick is to import the certificate into a PKCS12 keystore, which Java keytool also supports, and then convert it to the Java keystore format

2.1 Create a PKCS12 keystore and import (or export depending on how you look at it) the host certificate we just created

openssl pkcs12 -export -out keystore.pkcs12 -in HOSTNAME-certificate.pem -inkey HOSTNAME-private.pem

It will ask you for the export password, and it is recommended to provide a password.

2.2 Convert the PKCS12 keystore to Java keystore using Java keytool.

keytool -importkeystore -srckeystore keystore.pkcs12 -srcstoretype PKCS12 -destkeystore keystore.jks -deststoretype JKS

Keytool will first ask you for the new password for the JKS keystore twice, and it will also ask you for the password you set for the PKCS12 keystore created earlier.

Enter destination keystore password:
Re-enter new password:
Enter source keystore password:
Entry for alias 1 successfully imported.
Import command completed: 1 entries successfully imported, 0 entries failed or cancelled

It will output the number of entries successfully imported, failed, and cancelled.  If nothing went wrong, you should have another keystore file: keystore.jks

Part 3. Configure Tomcat to use HTTPS

With the keystore in place, we can now configure Tomcat to communicate via SSL using the certificate.

3.1 Configure Tomcat HTTPS Connector.

Edit CATALINA_HOME/conf/server.xml, where CATALINA_HOME is the base directory of Tomcat.  By default, the HTTPS Connector configuration is commented out.  We can search for “8443” which is the default port number for HTTPS connector, and then either replace the configuration block, or add another block just below.  We are going to use the Coyote blocking connector:

 <Connector port="8443" protocol="HTTP/1.1" SSLEnabled="true"
 maxThreads="150" scheme="https" secure="true"
 clientAuth="false" sslProtocol="TLS" />

<Connector port="8443" protocol="org.apache.coyote.http11.Http11Protocol" SSLEnabled="true" maxThreads="150" secure="true" scheme="https" keystoreFile="PATH/TO/keystore.jks" keystorePass="JKS_KEYSTORE_PASSWORD" clientAuth="false" sslProtocol="TLS" />

In the snippet above, PATH/TO/keystore.jks is the path to the Java Keystore we created earlier, and I recommend using the absolute path to eliminate any confusion.  Also provide the keystore password – it is in plain text, so protect server.xml using the correct permission (700).

The Tomcat SSL configuration instruction is a bit misleading and may let us believe both blocking and non-blocking should be configured.  This is not true because the port number can only be used by one connector type.

This configuration enables Tomcat to communicate HTTPS on port 8443.  At this point, it is a good idea to fire up Tomcat and make sure the configuration works using a web browser.


And point your web browser to https://HOSTNAME:8443 to see if Tomcat’s front page shows up.  Since we are using a self-signed certificate, your browser may complain about the certificate being not secure.  Accept the certificate so your browser can display the page.

3.2 Configure Tomcat to redirect HTTP to HTTPS

However, so far, Tomcat still supports HTTP (default port is 8443, but it may have been changed in your situation).  It would be desirable to automatically redirect any requests to the HTTP over to the HTTPS.  The first thing to do is edit CATALINA_HOME/conf/server.xml again, and this time, locate the Connector configuration for HTTP, and modify it so that the “redirectPort” attribute points to the HTTPS port (8443 by default).

<Connector port="8080" protocol="HTTP/1.1"
 redirectPort="8443" />

Now save server.xml, and edit web.xml, and add the following block to the end of the file, just before the </web-app> tag (in other words, the security-constraint section must be added AFTER the servlet-mapping sections:

<web-app ...>


      <web-resource-name>All Apps</web-resource-name>

Save this file, restart Tomcat again. This time, open a browser and enter the URL to the normal HTTP port, and see if Tomcat redirects to the HTTPS port.

Part 4. Create a test Java client to talk to Tomcat over SSL

Since we created our own self-signed certificate, if we just use a Java HttpsURLConnection client trying to connect to the Tomcat over SSL, it will not honor the certificate and throw an exception like the following:

Exception in thread "main" javax.net.ssl.SSLHandshakeException: sun.security.validator.ValidatorException: PKIX path building failed: sun.security.provider.certpath.SunCertPathBuilderException: unable to find valid certification path to requested target
 at com.sun.net.ssl.internal.ssl.Alerts.getSSLException(Alerts.java:174)
 at com.sun.net.ssl.internal.ssl.SSLSocketImpl.fatal(SSLSocketImpl.java:1731)
 at com.sun.net.ssl.internal.ssl.Handshaker.fatalSE(Handshaker.java:241)
 at com.sun.net.ssl.internal.ssl.Handshaker.fatalSE(Handshaker.java:235)
 at com.sun.net.ssl.internal.ssl.ClientHandshaker.serverCertificate(ClientHandshaker.java:1206)
 at com.sun.net.ssl.internal.ssl.ClientHandshaker.processMessage(ClientHandshaker.java:136)
 at com.sun.net.ssl.internal.ssl.Handshaker.processLoop(Handshaker.java:593)
 at com.sun.net.ssl.internal.ssl.Handshaker.process_record(Handshaker.java:529)
 at com.sun.net.ssl.internal.ssl.SSLSocketImpl.readRecord(SSLSocketImpl.java:925)
 at com.sun.net.ssl.internal.ssl.SSLSocketImpl.performInitialHandshake(SSLSocketImpl.java:1170)
 at com.sun.net.ssl.internal.ssl.SSLSocketImpl.startHandshake(SSLSocketImpl.java:1197)
 at com.sun.net.ssl.internal.ssl.SSLSocketImpl.startHandshake(SSLSocketImpl.java:1181)
 at sun.net.www.protocol.https.HttpsClient.afterConnect(HttpsClient.java:434)
 at sun.net.www.protocol.https.AbstractDelegateHttpsURLConnection.connect(AbstractDelegateHttpsURLConnection.java:166)
 at sun.net.www.protocol.https.HttpsURLConnectionImpl.connect(HttpsURLConnectionImpl.java:133)
 at test.SClient.main(SClient.java:97)
Caused by: sun.security.validator.ValidatorException: PKIX path building failed: sun.security.provider.certpath.SunCertPathBuilderException: unable to find valid certification path to requested target
 at sun.security.validator.PKIXValidator.doBuild(PKIXValidator.java:323)
 at sun.security.validator.PKIXValidator.engineValidate(PKIXValidator.java:217)
 at sun.security.validator.Validator.validate(Validator.java:218)
 at com.sun.net.ssl.internal.ssl.X509TrustManagerImpl.validate(X509TrustManagerImpl.java:126)
 at com.sun.net.ssl.internal.ssl.X509TrustManagerImpl.checkServerTrusted(X509TrustManagerImpl.java:209)
 at com.sun.net.ssl.internal.ssl.X509TrustManagerImpl.checkServerTrusted(X509TrustManagerImpl.java:249)
 at com.sun.net.ssl.internal.ssl.ClientHandshaker.serverCertificate(ClientHandshaker.java:1185)
 ... 11 more
Caused by: sun.security.provider.certpath.SunCertPathBuilderException: unable to find valid certification path to requested target
 at sun.security.provider.certpath.SunCertPathBuilder.engineBuild(SunCertPathBuilder.java:174)
 at java.security.cert.CertPathBuilder.build(CertPathBuilder.java:238)
 at sun.security.validator.PKIXValidator.doBuild(PKIXValidator.java:318)
 ... 17 more

To circumvent this problem, we should disable to the certificate validation in the client so we can move forward with the testing.  Add the following block of code in your Java client before creating any HttpsURLConnection (credits of this block of code goes to http://www.exampledepot.com/egs/javax.net.ssl/trustall.html:

// this block of code turns off the certificate validation so the client can talk to an SSL
// server that uses a self-signed certificate
// !!!! WARNING make sure NOT to do this against a production site
// this block of code owes thanks to http://www.exampledepot.com/egs/javax.net.ssl/trustall.html

TrustManager[] trustAllCerts = new TrustManager[] {
    new X509TrustManager() {
        public java.security.cert.X509Certificate[] getAcceptedIssuers() {
            return null;

        public void checkClientTrusted(java.security.cert.X509Certificate[] certs, String authType){}

        public void checkServerTrusted(java.security.cert.X509Certificate[] certs, String authType){}

SSLContext sslContext = SSLContext.getInstance("SSL");
sslContext.init(null, trustAllCerts, new java.security.SecureRandom());

// end of block of code that turns off certificate validation
// ////////////////////////////////////////////////////////////////////////////////////