Insan-IT Search:

Thursday, September 23, 2010

Rant: GPT versus LVM

It hit medium-sized data centres approximately 4 years ago and now it's barreling down on home PCs with the force of a locomotive. Another virus? Worm? Trojan? Not at all - I'm talking about the 2TB partition size limitation for MS-DOS MBR-style partitions! Not too many have been affected but Linux users purchasing high-end machines are starting to hear the rumblings as their favorite distro's installer craps out on them. With 4TB consumer-grade drives out on the market it's only a matter of time until the rest of us are confronted with this. My first reaction was
"Damn! I won't be able to use cfdisk and will have to settle for GNU parted! I hate GNU parted!"
...but this is much more far-reaching than just affecting your choice of partition editor.

It is, of course, not the first time the industry has faced such barriers and overcome them. The 1MB RAM barrier faced by Intel's 16-bit processors (worked around with 24-bit addressing on the 80286 and finally dealt with by the fully 32-bit 80386), the 8GB partition size limitation imposed by CHS addressing (worked around with extended partitions and finally dealt with by LBA addressing), Y2K date limitation (worked around with a cutoff of 1930 instead of 1900 making it the Y2K30 limitation and resolved by a 4-digit year making it the Y9K999 limitation), the 4GB RAM barrier (worked around with PAE and resolved with 64-bit processors), and the "limit" of only 4,000,000,000 (2^32 minus reserved) 32-bit IPv4 addresses (worked around with NAT and resolved with 128-bit IPv6 which nobody's using), to name a few. As usual the pattern will be work around it and then finally resolve it with a new standard. Whether that standard is an elegant forward-looking "fresh start" solution or an inelegant back(ass)wards-compatible kludge which only delays the problem is yet to be seen.

So what can we do right now? Well...we can simply create more partitions and delay the inevitable. Using MBR you can splinter your data and create 4 x 2TB monsters leaving you with a "new" theoretical limit of 8TB or, better yet, create an extended partition and have an infinite number of 2TB slices at your disposal. In my opinion, that's hardly ideal. The "industry" response to this is GPT - which creates a sloppy kludge on top of the old MBR structure and comes with it's own "new" limitations. Essentially it takes the MBR and "reserves" space using a partition of type 0xEE then allowing it to create it's own odd (each partition has a "type" so that vendors can fight over the essentially meaningless UUID for each "type" - a concept carried over from the old MBR-style partitions), arbitrary (The partition tables at sector 34? Why 34?), limited (128 partitions to a maximum size of 9.4ZB, or 9,400,000TB), fixed-form structure (there is, and will only be a V1.0 of GPT).

Well...we get 9ZB partitions...it's supported in the Linux kernel...problem solved, right? OK...yes in a barely and kludgy backwards-compatible way but solved, right? Well let's think here. What would be considered forward looking and potentially a superior solution to GPT? It's best if we look at some challenges and caveats of the MBR-style partition system:
  • Larger block sizes improve performance and increase storage capacity. Does it support block sizes other than the old DOS-style 512-bytes?
  • Sometimes partitions run out of space and we need more - it would be nice just to buy another disk and have more. Does it allow partitions to be combined seamlessly into logically contiguous blocks? On a live system in real time? If they aren't side-by-side? On different disks? On different machines?
  • We like to get the most from our hardware and losing data sucks. Does it allow for partition striping to gain performance or mirroring to gain reliability?
  • Working with live filesystems prevents certain important activities such as backups. Does it allow for instant copies of partitions on a live system to ease backups, imaging? Or for safe trials of such activities as filesystem performance tweaking or high-risk repair tools?
The answer to all of these problems being resolved by GPT is a resounding "NO". It was this comment which really got me thinking: doesn't LVM as implemented in Linux solve all of these problems?
As a matter of fact Linux LVM is the penultimate evolution of the whole "partitioning" solution. While LVM is currently limited to "only" 8 Exabytes (0.008 Zetabytes) this is a small price to pay for all that other functionality. Additionally, and unlike GPT, LVM standards are also developed in a revisable manner so it would be trivial to increase this in future revisions. Why was GPT even designed when a superior solution already existed? Why was the wheel reinvented...as a triangle?

In the name of simplicity I'm hoping LVM will one day overtake GPT as a straightforward unified method to divide disk storage and give us all the flexibility we deserve. The main hurdle to this happening is a certain antagonist commonly named in the industry: Microsoft. I haven't been tallying, but this will be approximately the 15,000th time the industry has had to settle for mediocrity and inferiority in order to be compatible with the mediocrity and inferiority of their products and general incompetence. Since it's open-source, it would technically be easier for Microsoft (and to a lesser degree other vendors) to support LVM than to "create" a solution to this problem. The code is there for them to freely use and implement without the threat of getting sued over patents and/or copyrights. (As an aside, this is Microsoft's dirty little secret of course, since the case is the exact opposite for Linux, standards bodies, and other tech companies supporting their technologies and standards. After 15 years in this industry one sees Microsoft not choose external standards or technologies and instead create their own, not so much because of classic NIH, but rather because they seem to like kicking small puppies with any mighty newfound "Intellectual Property" powers they come to possess).

Personally, I don't need the added complexity of 2 impotent partitioning schemes (and all the related problems that pop up from time to time) and would like the possibility of simply having LVM manage my storage. Before this can happen though, it has one major hurdle to overcome: LVM is not supported by "standard" BIOS since it's not part of the EFI "standard" to which we can expect almost all current and future BIOSs to adhere. This means it is not accessible nor bootable from the BIOS and either MBR or GPT is required at least until a real OS of some kind is loaded.

Will this be the final result? GPT is here to stay because the "industry" has decided? In my opinion, yes and the two will coexist. However, there are many smart data centre administrators and savvy techies out there who will add some weight by choosing a superior configuration for their machines and making it work (custom BIOSs and booting from other media come immediately to mind). Second, booting from LVM becomes a greater possibility when you consider that computer BIOSs are no longer a fixed, burnt-in, permanence. They can be flashed and superior open alternatives are available to give Pheonix, AMI, Byosoft the kick-in-the-ass they've sorely needed for decades! Consider this a challenge to the coreboot community: help support superior solutions and make BIOSs LVM-aware!

The way I see it, MBR and the 2TB partition size limit is not a new limit at all but an expansion. I now can have up to 2TB of bootstrap code in which to get a running OS to use my LVM volumes on the rest of the drive ;- )

I'm curious if it's possible to create an MBR-style partition for /boot, an 0xEE partition to tell old-school tools to "piss off", and an LVM volume embedded directly into the hdd block device right after the /boot partition ends.

In the meantime is anybody else using LVM instead of GPT? I'd like to hear about different approaches people have taken and how they've panned out!

12 comments:

Anonymous said...

You sure seem to like pissing into the wind!

Either way it sounded like a challenge to me so I gave it a shot...no go. Below is a copy of the post I made to the Gentoo forums:
OK...so if you create an LVM volume on a disk you can "reserve" a small amount of space by moving the initial label up to the 3rd sector:
pvcreate --labelsector 3 /dev/sda


I did a hexdump and got:
00000000 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 |................|
*
00000600 4c 41 42 45 4c 4f 4e 45 03 00 00 00 00 00 00 00 |LABELONE........|
00000610 68 6b 72 e7 20 00 00 00 4c 56 4d 32 20 30 30 31 |hkr. ...LVM2 001|
00000620 63 6c 44 51 6a 32 44 58 4b 53 36 38 47 6b 6e 61 |clDQj2DXKS68Gkna|
00000630 32 50 36 76 6a 44 32 49 59 6f 74 59 39 78 36 67 |2P6vjD2IYotY9x6g|
00000640 00 60 97 f2 1b 00 00 00 00 00 03 00 00 00 00 00 |.`..............|
00000650 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 |................|
00000660 00 00 00 00 00 00 00 00 00 10 00 00 00 00 00 00 |................|
00000670 00 f0 02 00 00 00 00 00 00 00 95 f2 1b 00 00 00 |................|
00000680 00 60 02 00 00 00 00 00 00 00 00 00 00 00 00 00 |.`..............|
00000690 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 |................|
*
00001000 f7 a5 f9 a0 20 4c 56 4d 32 20 78 5b 35 41 25 72 |.... LVM2 x[5A%r|
00001010 30 4e 2a 3e 01 00 00 00 00 10 00 00 00 00 00 00 |0N*>............|
00001020 00 f0 02 00 00 00 00 00 00 06 00 00 00 00 00 00 |................|
00001030 a6 03 00 00 00 00 00 00 ea 36 42 05 00 00 00 00 |.........6B.....|
00001040 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 |................|
*
00001200 56 47 5f 54 45 53 54 20 7b 0a 69 64 20 3d 20 22 |VG_TEST {.id = "|
00001210


So there's usable space between 0x0000 and 0x0600 (1536 bytes). There's also space between 0x0690 and 0x1000 (2416 bytes) although I doubt it's "safe" to use that for boot code.

So I just installed Grub 2 and took a look at what I needed:
/lib/grub/i386-pc/lnxboot.img = 1024 bytes
/lib/grub/i386-pc/ext2.mod = 5524 bytes
/lib/grub/i386-pc/lvm.mod = 5496 bytes


Some other interesting ones:
/lib/grub/i386-pc/ata.mod = 7972 bytes
ata.mod.gz = 3963 bytes
/lib/grub/i386-pc/gzio.mod = 7740 bytes
ext2.mod.gz = 2968 bytes
lvm.mod.gz = 2896 bytes


Definitely a no go...bummer this was a fun mini-project. About the only steps I could take from here would be to use a slightly smaller LVM and embed grub code at the end of the disk or create a loopback device with an offset to make space for grub code. I actually tried the loopback method and it works, but who cares - BOTH of these solutions are a bigger "kludge" than GPT!

Anonymous said...

It would have to be a standard - you want something simple all utilities can use. Besides, with LVM then any running OS would have to shuffle blocks around and potentially affect the others. So you run your Windows7istaUltimate and next this you know your XYZ Linux won't boot. Do you really want to give Microsoft another avenue for making Linux look "broken" and incapable?

Anonymous said...

Ummm...it may be "kludgy" and all but the problem with doing it any nonstandard way is that repair tools will have no idea how to work with what you've got. Unless you design them in-house of course ;)

Surviet said...

hello,
i'm not an expert and come to this page by web searching about GPT and LVM.
I have some hardware to setup :
- 1 server
- 4 storage bays with 12x SAS 300 GB hard drives on each

As for some services I need storage performance to be increased, i'm interested to get the most hard drivers as workers to avoid I/O waits (eg on our mail servers).
For now, i'm on testing phase, the solution is not yet, deployed.
So, follows what i did :
- 2 bays are connected in unified mode to get 24 hard drives (so total storage is bigger than 2 TB).
- the server is connected with the 1st bay via a basic SAS controller.
Yes the people who gave me this hardware didn't buy a decent controller.
The motherboard integrated SAS controller only support RAID0 and RAID1.
2x PCIe SAS controllers are also present but they don't have BIOS.

I created an mirrored array (raid1) with integrated controller that handle the 2 internal (inside the server) hard drives.
Ubuntu server is installed on it, maybe i'll change distro later, that is not so important.
To get all 24 hdd available as 1 storage, i did use software raid (mdadm) and created 1 raid5 array with all disks.
Then, with parted, i created GPT partition and added LVM support.
So I planned to combine GPT and LVM.
Is that useless ?
Maybe i'm wrong, but what i understood is that GPT is part of EFI which replaces BIOS.
So LVM is not at the same level, but a higher level (software).
Accepting these assertions, i think we cannot compare both ?
Apple integrated EFI (even if it is modified version) since years.
Some PC motherboards also give EFI support.
Aren't BIOS and EFI from Intel and not Microsoft ?
As I know (maybe wrong again), EFI is there as a modern solution to replace BIOS and MS-DOS compatibility limitations (4 partitions, 2 TB etc...).
I see the goodness that comes with LVM at a higher level.
You talked about integration of LVM into GPT supported filesystems.
As I read from wikipedia, linux partitions have GUIDs, isn't that sufficient ?

Bake said...

Surviet,

Sorry for the delayed answer to your questions - I haven't had ANY time for the blog in month.

This post partially qualifies as a rant but my primary point was that GPT does little to advance the problem of disk space allocation past basic partitioning as was done by the DOS MBR scheme while adding a heap of complexity to it which WILL contribute in Real-Fucking-Life (RFL) failures and problems. Take for example the fact that it requires partition information to be stored at the END of a disk and the implications this has for the common practice of cloning a failing disk, sector-by-sector, to a larger one. Now you will need specialized crapware to do it or face problems and failures...and why? To provide backwards-compatibility with DOS. Will anyone even know what "DOS" was 15 years from now...and yet we'll have this legacy GPT crap supporting it's old partitioning scheme. Which is where I justify bashing Microsoft (and their lackies including Intel) in this post.

LVM on the other hand does many things to advance the disk allocation problem and so it's complexity is, IMO, well warranted.

So to answer your questions directly:
1) So you've combined GPT and LVM (on your first disk only). What happens to the complexity of replacing that first disk (or array) with a larger one? Can you remove and replace it on a live system like you can with your LVM-only disks? This is what I refer to when I talk about RFL impact. What is the GPT component giving you in return for making this such a pain in the ass? I, personally, WOULD call that useless.
2) The BIOS is software code just like LVM drivers or LVM-aware boot code is software. BIOS just happens written to the chip which starts the computer. There's no practical reason a machine couldn't have LVM-boot code in it's BIOS. So YES you can compare both because both aim to solve the disk-allocation problem, just one does it MUCH better than the other.
3) Reserved partition identifiers simply show the 1980's archaic thinking (with a 1990's GUID twist) of GPT's architects. A partition really becomes identifiable once you put metadata on it (format it) for a filesystem or other use (such as SWAP or suspend/resume data). The goodness of LVM is at all levels as it even excels here too in providing you straight storage for allocation and formatting as you see fit without any meaningless "Type IDs" to confuse things.

The only real reason you'd want GPT is to dual-boot with Windows or another system but for me that's compromising the security and rock-solid stability of the Linux/BSD installed beside it since Windows has read/write access to the whole disk when booted. IMHO, it's best to keep sloppy OSes contained using virtualization.

Now what I'd REALLY like is to be able to boot my system with LVM only...no legacy GPT garbage on it at all. Anyone who can help me with that will be a personal hero to me this year.

cmurphy said...

Well if GPT is legacy garbage, then MBR is also legacy garbage. I agree with the general assessment that GPT doesn't go far enough in solving certain problems. But it also seems that it is solving a constrained set of problems: physical disk size, and partition size and quantity limitations. It does those three things and is supported on Mac OS, Windows, and Linux and others. Where as LVM simply isn't, even if it would be a good idea.

Also, GPT isn't a Microsoft thing, your complaint is better directed at Intel. GPT is tied to EFI (now UEFI) and go hand in hand.

I think you're stuck with some kind of primary partitioning scheme, either GPT or MBR, because LVM2 isn't a primary partitioning scheme, it's sub-partitioning and supra-partitioning. It has a dependency on one of these two partitioning schemes.

But if the argument is to redesign something new, it probably makes more sense to leave GPT alone and get LVM features into the filesystem which is sorta the direction ZFS and btrfs are going in, which largely obviate needing a separate LVM.

I don't know that sectors remain relevant at all. For some time drives have internally managed actual sectors, while externally they communicate with the OS's file system with LBA. Since the drive deals with error detection and correction itself, this makes some sense. But with ZFS and btrfs having even more robust error detection and correction does it make more sense to have a kind of raw access to our storage devices instead for more advanced file systems?

Bake said...

cmurphy,
It's nice to get some intelligent feedback on the blog, let alone a rant like this one. Your points certainly valid but some warrant a little closer look:
1) "...is supported on Mac OS, Windows, and Linux and others. Where as LVM simply isn't...".
This is a circular argument. My point is that LVM technology should have been leveraged to solve the 2TB MBR limit rather than a new kludgy "standard" developed...you know before it was pushed and supported on all those platforms. While the LVM code and drivers are Linux specific the disk format and specs are very OS-agnostic so why this wasn't leveraged by the goofs that developed GPT is beyond me.
2) "...Also, GPT isn't a Microsoft thing, your complaint is better directed at Intel. GPT is tied to EFI (now UEFI) and go hand in hand..."
Seriously? Intel doesn't go pee without Microsoft's go-ahead.
3) "...I think you're stuck with some kind of primary partitioning scheme, either GPT or MBR, because LVM2 isn't a primary partitioning scheme, it's sub-partitioning and supra-partitioning..."
Can you define a "primary" partitioning scheme and "subprimary" partitioning scheme? I can throw LVM on a raw disk by itself (which I assume makes it "primary" since it's the only lowest-level disk format there) and it can function just fine (as tested on my Linux box). The only caveat is that it can only leave 4K space for a boot loader which is insufficient for GRUB2 to load with an LVM module. Other than that it will work just fine.

Hope to hear a response.

cmurphy said...

Mostly I just really don't understand your arguments because you bring up so many of them like throwing spaghetti on a wall to see what sticks. And you even impugn the work of others, while simultaneously unqualifying these grand statements.

For example that GPT is not just garbage but legacy garbage, and now it's a kludge. So is it kludgy legacy garbage? It's a very simple extension, designed to fix really specific problems without getting overly complicated and carried away. We're talking about people's data - potentially extremely important data. I think you want an initial scheme that contains the most basic data for the most basic systems because it's going to be used on mobile devices not just servers. It doesn't need to be overly capable, it needs to be stable and something that isn't subject to routine change. Most people partition a drive and then that's it.

This is clearly a set of choices engineers had to make between flexibility and stability. Anything more capable is more complex, more complexity means the various partition tools must account for that complexity. If even one of them gets something wrong because of some ambiguity now they all have to be aware of this derivative and account for it (or fix it even possibly). So I think they just went with a very conservative approach that addressed the top issues with MBR and left it at that.

BIOS is to MBR, as (U)EFI is to GPT. There is no LVM in that scheme at all, it's a separate thing. And Linux LVM/LVM2 is GPL'd you will not get Microsoft or Apple adopting it. In fact Apple has just produced their own LVM in the latest OS, and so far as I can tell it is not readable by any other OS, not even the previous version of their own OS. Yet they all agreed on GPT because they all agreed on (U)EFI.

So you have LVM alone on your disk but if this is a bootable disk it must also have an MBR because without it a BIOS based computer will not boot. Sure you could zero out the partition portion of the MBR, but a bulk of the MBR is not partition information but rather the bootloader. So you have an MBR even if it's non-standard/incomplete. The bootloader is loaded and stage loads, and then one of those stages eats a configuration file that tells it where the LVM starts, and it then reads that away you go.

But every tool on the planet for a hard drive will tell you your drive is fakaked. So I don't know why this really matters. I can have a perfectly bootable and useable hard drive without partitioning, without LVM and without a file system on the disk. All I have to do is tell the device to feed me a stream of LBA's from the drive in the blind and execute. I bet you I don't get all that far before I start having "OK now what?" sorts of problems with my wonderful system.

Anyway I don't really understand the complaint or the question or the relevance of either. We have what we have, I think GPT achieves its design goals and overcomes the limitations of MBR. I don't find the argument that LVM could have fixed this, let alone in a way that would have been acceptable to the entire industry, including mobile and embedded devices which also need partitioning schemes and more often than not do not use LVM and don't have a need for LVM because they don't have particularly sophisticated requirements in that regard.

So the question I have for you is, why would you subject others only to a scheme that is substantially more capable, complex and sophisticated than they require? Why not allow the basic thing that every one has out of the gate to be overly simple because it just works?

Anonymous said...

I am really loѵing the theme/ԁesіgn of your
sіte. Dо you eveг run into аnу web browsег compatibility problemѕ?
A number of my blog гeaders have complained аbout
my webѕіtе not working сorгеctly in Εxplorer but lοοκs gгеat in Oρera.
Do yοu have any tips to help fix thіs ρroblеm?


Loοk at my page; seopressor version5
my website: www.facebook.com

Anonymous said...

Ηmm it loοks like your website ate my first comment (it was extгemely long) ѕo I guess I'll just sum it up what I submitted and say, I'm thoroughly enjoying your blog.

I as well am an asρirіng blog blogger but I'm still new to everything. Do you have any recommendations for rookie blog writers? I'd certainly appreсiate it.


Μу sitе ... SEOPressor V5 review

Anonymous said...

What's up it's me, I аm alѕο vіѕіting thiѕ websitе on a regulaг basis, thiѕ wеb pagе is tгuly gooԁ and thе ρeоple aге
in fаct shаring pleаsant thοughts.



Μy homepage; SEOPressor V5

Milind R said...

You don't seem to get the point of GPT!

You're being idiotic by thinking of GPT being ON TOP of MBR. THE ONLY REASON there is a PROTECTIVE MBR on a GPT disk is so that MBR disk tools stay the heck away from it.

Also you might want to see this - apparently wiki is not accurate about block 34.

http://superuser.com/questions/297232/is-gpt-aligned-for-4k-blocks

I am unable to find any info on GPT and block sizes, but I highly doubt they would have fixed it in the standard. If it is, I would agree with you.

The partition table is stored at both beginning and end to ensure reliability - if one table is corrupted, get the other.

I do not understand what is kludgy about GPT. Kludge would mean unnecessary and badly-designed elements. Maybe it could have just adopted lvm, but I myself don't feel comfortable with non-dynamically resizable and movable partitions until I know all about it. I imagine many people would agree with me.

Best of Insan-IT