Understanding disk usage in Linux


Understanding disk usage in Linux” is a well written in-depth look into the Linux filesystem layer and how things work under the hood.  This is probably not something most people would have to deal on a day-to-day basis, but it is very useful for anyone doing system administration and looking for the better understanding of operating systems.

Getting the best performance out of Amazon EFS


Jeff Geerling shares his tips for “Getting the best performance out of Amazon EFS”.  Given how (still) new the Amazon EFS is and how limited is the documentation of the best practices, this stuff is golden.

tl;dr: EFS is NFS. Networked file systems have inherent tradeoffs over local filesystem access—EFS doesn’t change that. Don’t expect the moon, benchmark and monitor it, and you’ll do fine.

Amazon Snowmobile – a truck with up to 100 Petabytes of storage


Back in my college days, I had a professor who frequently used Andrew Tanenbaum‘s quote in the networking class:

Never underestimate the bandwidth of a station wagon full of tapes hurtling down the highway.

I guess he wasn’t the only one, as during this year’s Amazon re:Invent 2016 conference, the company announced, among other things, a AWS Snowmobile:

Moving large amounts of on-premises data to the cloud as part of a migration effort is still more challenging than it should be! Even with high-end connections, moving petabytes or exabytes of film vaults, financial records, satellite imagery, or scientific data across the Internet can take years or decades. On the business side, adding new networking or better connectivity to data centers that are scheduled to be decommissioned after a migration is expensive and hard to justify.

[…]

In order to meet the needs of these customers, we are launching Snowmobile today. This secure data truck stores up to 100 PB of data and can help you to move exabytes to AWS in a matter of weeks (you can get more than one if necessary). Designed to meet the needs of our customers in the financial services, media & entertainment, scientific, and other industries, Snowmobile attaches to your network and appears as a local, NFS-mounted volume. You can use your existing backup and archiving tools to fill it up with data destined for Amazon Simple Storage Service (S3) or Amazon Glacier.

Thanks to this VentureBeat page, we even have a picture of the monster:

aws-snowmobile

100 Petabytes on wheels!

I know, I know, it looks like a regular truck with a shipping container on it.  But I’m pretty sure it’s VERY different from the inside.  With all that storage, networking, power, and cooling needed, it would be awesome to take a pick into this thing.

 

 

Amazon Elastic File System


Here are some great news from the Amazon AWS blog – the announcement of the Elastic File System (EFS):

EFS lets you create POSIX-compliant file systems and attach them to one or more of your EC2 instances via NFS. The file system grows and shrinks as necessary (there’s no fixed upper limit and you can grow to petabyte scale) and you don’t pre-provision storage space or bandwidth. You pay only for the storage that you use.

EFS protects your data by storing copies of your files, directories, links, and metadata in multiple Availability Zones.

In order to provide the performance needed to support large file systems accessed by multiple clients simultaneously,Elastic File System performance scales with storage (I’ll say more about this later).

I think this might have been the most requested feature/service from Amazon AWS since EC2 launch.  Sure, one could have built an NFS file server before, but with the variety of storage options, availability zones, and the dynamic nature of the cloud setup itself, that was quite a challenge.  Now – all that and more in just a few clicks.

Thank you Amazon!

 

NAS Performance: NFS vs Samba vs GlusterFS


I came across this question and also found the results of the benchmarks somewhat surprising.

  • GlusterFS replicated 2: 32-35 seconds, high CPU load
  • GlusterFS single: 14-16 seconds, high CPU load
  • GlusterFS + NFS client: 16-19 seconds, high CPU load
  • NFS kernel server + NFS client (sync): 32-36 seconds, very low CPU load
  • NFS kernel server + NFS client (async): 3-4 seconds, very low CPU load
  • Samba: 4-7 seconds, medium CPU load
  • Direct disk: < 1 second

The post is from 2012, so I’m curious if this is still accurate. Has anybody tried this? Can confirm or otherwise?

Also, an interesting note from the answer to the above:

From what I’ve seen after a couple of packet captures, the SMB protocol can be chatty, but the latest version of Samba implements SMB2 which can both issue multiple commands with one packet, and issue multiple commands while waiting for an ACK from the last command to come back. This has vastly improved its speed, at least in my experience, and I know I was shocked the first time I saw the speed difference too – Troubleshooting Network Speeds — The Age Old Inquiry

 

Amazon EFS preview


Amazon EFS

Amazon Elastic File System, or EFS for short, is the missing piece of the cloud puzzle.  With all those EC2 instances, elastic load balances and IAM roles, one would often need a shared file system.  Until now, you’d either be using either an S3-based solution, which scales well in terms of price and storage, but lacks in common tools support and sometimes in real-time synchronization; or an EBS-based solution, which performs way better (especially with SSD-backed storage) and works like a regular file system, but is a bit more pricey and lacking, being a block-level solution, the sharing option – so you’d have to build something like a GlusterFS solution or an NFS server, both of which have their own issues.

So, the arrival of the EFS, even as a preview for now, will bring joy to many.

Amazon EFS is a new fully-managed service that makes it easy to set up and scale shared file storage in the AWS Cloud. Amazon EFS supports NFSv4, and is designed to be highly available and durable. Amazon EFS can support thousands of concurrent EC2 client connections with consistent performance, making it ideal for a wide range of use cases, including content repositories, development environments, and home directories, as well as big data applications that require on-demand scaling of file system capacity and performance.

(Quote from the webinar pitch)

In terms of integration, it looks easy for the Linux crowd – NFSv4 option is there.  What’s happening in the Windows world, I’m not that aware though.  Gladly, that’s not my problem to worry.

In terms of pricing, this looks a bit expensive.  The calculations are in GB-Months, with the current price being $0.30 per GB-Month.  An example for 150 GB used over the first two weeks of the month and 250 GB sued over the second half of the month, yields a 177 GB-Month average at a cost of $53.10 USD.  Even knowing that EFS is riding on SSD-based hardware and should be quite fast, the price is high.  Amazon is known however for its regular price reductions.

So for now, I’d wait.  It’s good to know that the option is there (or almost there, preview still pending).  But for the masses to jump onto it, it’ll need to calm down its dollar hunger a bit.

πfs – the data-free filesystem!


πfs – the data-free filesystem!

πfs is a revolutionary new file system that, instead of wasting space storing your data on your hard drive, stores your data in π! You’ll never run out of space again – π holds every file that could possibly exist! They said 100% compression was impossible? You’re looking at it!

Scaling the Facebook data warehouse to 300 PB


Scaling the Facebook data warehouse to 300 PB

At Facebook, we have unique storage scalability challenges when it comes to our data warehouse. Our warehouse stores upwards of 300 PB of Hive data, with an incoming daily rate of about 600 TB. In the last year, the warehouse has seen a 3x growth in the amount of data stored. Given this growth trajectory, storage efficiency is and will continue to be a focus for our warehouse infrastructure.