1. Characteristics of large data centers (like Google data centers): - Component failures are the norm
rather than the exception. Constant monitoring, error detection, fault tolerance, and automatic
recovery must be integral to the system
- Files are huge by traditional standards
- Most files are mutated by appending new data rather than overwriting existing data. Random
writes within a file are practically non-existent. The system must efficiently implement well-
defined semantics for multiple clients that concurrently append to the same file.
- Co-designing the applications and the file system API benefits the overall system by
increasing flexibility
- High sustained bandwidth is more important than low latency.
2. Google File System interface: - Files are organized hierarchically in directories and identified
by pathnames
- operations to create, delete, open, close, read, and write files
- snapshot and record append operations (multiple clients to append data to the same file
concurrently)
3. Google File System Architecture: - single master and multiple chunk servers and are
accessed by multiple clients
- commodity Linux machine running a user-level server process
- Files are divided into fixed-size chunks (ie 64 bits). Each chunk is replicated on multiple (3)
chunk servers
- The master maintains all file system metadata
4. Examples of GFS system metadata: the namespace, access control informa- tion, the mapping
from files to chunks, the current locations of chunks, chunk lease management, garbage
collection of orphaned chunks, and chunk migration between chunk servers
5. The master periodically communicates with each chunk server in to
give it instructions and collect its state.: The master periodically communicates with each chunk
server in HeartBeat messages to give it instructions and collect its state.
6. Clients interact with the for metadata operations, but all data-bear- ing
communication goes directly to .: Clients interact with the master for metadata
operations, but all data-bearing communication goes directly to the chunk servers.
7. In GFS, why Neither the client nor the chunk server caches file data ?: - most applications stream
through huge files or have working sets too large to be cached
- eliminating cache coherence issues
- chunks are stored as local files at chunk servers
- Linux's buffer cache already keeps frequently accessed data in memory
, Cloud Computing - Technical Foundations
8. The Master's operations include?: - all namespace operations
- managing of chunk replicas throughout the system
- making chunk placement decisions
- creating new chunks and hence replicas
- coordinating various system-wide activities to keep chunks fully replicated
- balancing load across all the chunk servers
- reclaiming unused storage
9. GFS Chunk size and its advantages: Chunk size is 64MB which is larger than typical file system
block sizes. Large chunk has the following advantages:
- reduces clients' need to interact with the master because reads and writes on the same
chunk require only one initial request to the master for chunk location information
- client is more likely to perform many operations on a given chunk, it can reduce network
overhead by keeping a persistent TCP connection to the chunkserver over an extended period
of time
- reduces the size of the metadata stored on the master
10.GFS Chunk size and its disadvantages: - If a popular file is small (small number of chunks), it
can become a hotspot on stored chunk servers (many users accessing it)
11.How to solve hot spot issues in GFS?: - increase the number of replications
- stagger requests
12.How GFS handle metadata?: - 3 types: the file and chunk namespaces, the mapping from
files to chunks, and the locations of each chunk's replicas
- All metadata is kept in the master's memory. Namespaces and file-to-chunk mapping are
also kept persistent by logs stored on the master's local disk and replicated on remote
machines
- Master asks each chunk server about its chunks at master startup and whenever a chunk
server joins the cluster
13.Why putting metadata in Master server's memory is not a risk in practice?-
: - the master maintains less than 64 bytes of metadata for each 64 MB chunk
- the file namespace data typically requires less then 64 bytes per file because it stores file
names compactly using prefix compression
- it's not expensive or complicated to add more memory to Master
14.Regarding operation log, we must do this or we will effectively lose the whole file system or
recent client operations even if the chunks themselves survive: We must store operation logs
reliably and not make changes visible to clients until metadata changes are made persistent.
Specifically, we store the logs on multiple machines and respond to a client operation only
after flushing the corresponding log record to disk both locally and remotely.
, Cloud Computing - Technical Foundations
15.The master recovers its file system state by?: replaying the operation log
16.The master checkpoints its state when?: the log grows beyond a certain size
17. The Master's checkpoint is in a form that can
be directly mapped into memory and used for namespace lookup without .: compact B-tree like
extra parsing
18.Master Recovery needs ?: - latest complete checkpoint
- subsequent log files
19.A failure during checkpointing does not affect correctness because?: the recovery code
detects and skips incomplete checkpoints
20.GSF data mutations: writes or record appends
21.GFS record append: A record append causes data (the "record") to be append- ed atomically
at least once even in the presence of concurrent mutations, but at an offset of GFS's choosing
22.The state of a GFS file region after a data mutation depends on?: - the type of mutation
- whether it succeeds or fails
- whether there are concurrent mutations
23.A GFS file region is consistent if?: all clients will always see the same data, regardless of
which replicas they read from
24.A GFS region is defined after a file data mutation if?: it is consistent and clients will see
what the mutation writes in its entirety
25.Concurrent successful mutations leave the GFS region undefined but consistent when?: all
clients see the same data, but it may not reflect what any one mutation has written
Typically, it consists of mingled fragments from multiple mutations
26.A failed mutation makes the region inconsistent (hence also undefined) when?: different
clients may see different data at different times
27.After a sequence of successful mutations, the mutated file region is guar- anteed to be defined
and contain the data written by the last mutation. GFS achieves this by?: - applying mutations to a
chunk in the same order on all its replicas
- using chunk version numbers to detect any replica that has become stale because it has
missed mutations while its chunk server was down. Stale replicas will never be involved in a
mutation or given to clients
28.Since clients cache chunk locations, they may read from a stale replica before that information is
refreshed. How GFS solves this provlem?: - limit cache window by utilizing cache entry's timeout
and the next open of the file, which purges cache