Field notes from explorations on a big file system

Florian Ziemen and Janos Zimmermann (DKRZ)

FAQ

Why do you do this?

Because I can and it’s needed

  • I speak User and System.
  • We have limited knowledge on what is needed where
  • I encounter various ideas that don’t line up with experience

What kinds of data access do we see?

  • Starting python / … (many small IOPS)
  • Model output (continuous write, rather scattered reads)
  • Restarts (bursts of massive write / read operations)

How much output does a simulation produce?

As much as there is disk space available to the user.

  • Output can be configured in terms of frequency and variables written
  • More output = less risk of not being able to do an analysis
  • Disk space is limited by results of compute time application

How often is the data read?

On average, about twice

(read rate / write rate)

And how’s the distribution?

6 months of access to nextGEMS Cycle 3 - full resolution daily output
blue = all regions | red = all time steps | white = both equally | gray = none

What’s the write speed needed?

Disk space / simulation duration
(for models doing asynchronous output)

  • nextGEMS pre-final:
    • ~ 2 PB in 30 days = 800 MB/s for a 500 node job
      (1/6 of the machine)
    • Spread across many files
    • in reality a bit more, as the output config was trouble and some got written and trashed immediately.
    • corresponds to 5 GB/s for all of levante

What’s the speed needed?

As much as we can get
For restart files

  • ICON 1.25 km (900 nodes)
    • 30-60 seconds writing 13 TB into 4500 files.
    • 200-400 GB/s = 50-100 MB/s/file = 250-500 MB/s/node
  • Similar read time
  • Write happens once per hour, so neglible cost.
  • Read once every 6 hr or so.

Anything you omitted?

Yes, sure, postprocessing for example

  • Some models don’t use async output (yet).
  • Probably still a good bit of serial output from rank0 on the system.
  • Some models use mpi-io (ICON can, but usually doesn’t, afaIk).

So what’s the real numbers?

Per day

Per hour

Based on one- or two-week write speed data kindly provided by Carsten Beyer. Includes /scratch/ and /work/.

How can we estimate needs?

Back of the envelope calculations

The data lake

I hide post-processing the in the data flow for simplicity

An estimate based on archiving speed

40 drives, half of them for writing, 300MB/s per drive

An estimate based on turnover

Off by a factor of two, but still not too bad – probably need to account more strongly for data that is re-processed or garbage

How about reading?

Observed lustre total traffic rates

Binning of one-minute traffic data for one- or two-week intervals kindly provided by Carsten Beyer. Includes /scratch/ and /work/.

And then there was late December

Green = Lustre read

Yellow = Lustre write

(work and scratch)

Thanks to Carsten Beyer for pointing this out.

And then there was late December

Binning of one-minute traffic data for one- or two-week intervals kindly provided by Carsten Beyer. Includes /scratch/ and /work/.

The culprit…

(and a bunch of similar jobs on the other GPU nodes)

See Pay’s Tech Talk on Jan 16 for details on ClusterCockpit.

Read traffic by day and hour

Based on one- or two-week traffic data kindly provided by Carsten Beyer. Includes /scratch/ and /work/.

Other special behaviors?

Users requesting tons of files from one node

  • Led to lustre flooding the network with data and causing traffic jams slowing down the MPI communication.
  • Mitigated by the separation of lustre and MPI into separate virtual IB lanes (thanks, JFE!)
  • Further mitigation by collecting post-processing nodes in one rack.

Other special behaviors?

Users compensating for bad access patterns and file formats (GRIB) by throwing many nodes at the problem.

  • Led to massive traffic (up to 200 GB/s from 20 nodes).
  • Partly caused by badly arranged loops.
  • Mitigation by contacting and restricting individuals.

Other special behaviors?

Cache misses and hits Thanks to Natalija for pointing this gem out!

What do we learn from this?

Write speed

  • For most of the storage, size, not speed is the issue
    (as long as you can flush the system within a few months)
  • For restarts, speed matters, but most of them are read 0-1 times, and then deleted.

Read speed

  • Generally speed per file does not really matter too much (~1GB/s per file is fine in most cases)
  • Strong diurnal cycle in usage
    -> Read is mostly from (semi-interactive) post-processing.
  • We need to figure out what ML needs
  • We need a crystal sphere to figure out, how ML will develop
  • We need a system that’s robust against users accidentally wrecking havoc.