Browse Source

Squashed 'src/leveldb/' changes from a31c8aa40..196962ff0

196962ff0 Add AcceleratedCRC32C to port_win.h
1bdf1c34c Merge upstream LevelDB v1.20
d31721eb0 Merge #17: Fixed file sharing errors
fecd44902 Fixed file sharing error in Win32Env::GetFileSize(), Win32SequentialFile::_Init(), Win32RandomAccessFile::_Init() Fixed error checking in Win32SequentialFile::_Init()
5b7510f1b Merge #14: Merge upstream LevelDB 1.19
0d969fd57 Merge #16: [LevelDB] Do no crash if filesystem can't fsync
c8c029b5b [LevelDB] Do no crash if filesystem can't fsync
a53934a3a Increase leveldb version to 1.20.
f3f139737 Separate Env tests from PosixEnv tests.
eb4f0972f leveldb: Fix compilation warnings in port_posix_sse.cc on x86 (32-bit).
d0883b600 Fixed path to doc file: index.md.
7fa20948d Convert documentation to markdown.
ea175e28f Implement support for Intel crc32 instruction (SSE 4.2)
95cd743e5 Including <limits> for std::numeric_limits.
646c3588d Limit the number of read-only files the POSIX Env will have open.
d40bc3fa5 Merge #13: Typo
ebbd772d3 Typo
a2fb086d0 Add option for max file size. The currend hard-coded value of 2M is inefficient in colossus.

git-subtree-dir: src/leveldb
git-subtree-split: 196962ff01c39b4705d8117df5c3f8c205349950
pull/3/head
Pieter Wuille 5 years ago
parent
commit
cf44e4ca77
  1. 12
      Makefile
  2. 25
      README.md
  3. 30
      build_detect_platform
  4. 38
      db/db_bench.cc
  5. 1
      db/db_impl.cc
  6. 2
      db/log_format.h
  7. 57
      db/version_set.cc
  8. 2
      db/version_set.h
  9. 89
      doc/doc.css
  10. 213
      doc/impl.html
  11. 170
      doc/impl.md
  12. 549
      doc/index.html
  13. 523
      doc/index.md
  14. 75
      doc/log_format.md
  15. 75
      doc/log_format.txt
  16. 107
      doc/table_format.md
  17. 104
      doc/table_format.txt
  18. 2
      include/leveldb/db.h
  19. 12
      include/leveldb/options.h
  20. 6
      port/port_example.h
  21. 2
      port/port_posix.h
  22. 129
      port/port_posix_sse.cc
  23. 2
      port/port_win.h
  24. 2
      table/filter_block.cc
  25. 18
      util/crc32c.cc
  26. 195
      util/env_posix.cc
  27. 66
      util/env_posix_test.cc
  28. 28
      util/env_posix_test_helper.h
  29. 18
      util/env_test.cc
  30. 12
      util/env_win.cc
  31. 1
      util/options.cc

12
Makefile

@ -44,6 +44,7 @@ TESTS = \ @@ -44,6 +44,7 @@ TESTS = \
util/cache_test \
util/coding_test \
util/crc32c_test \
util/env_posix_test \
util/env_test \
util/hash_test
@ -121,7 +122,7 @@ SHARED_MEMENVLIB = $(SHARED_OUTDIR)/libmemenv.a @@ -121,7 +122,7 @@ SHARED_MEMENVLIB = $(SHARED_OUTDIR)/libmemenv.a
else
# Update db.h if you change these.
SHARED_VERSION_MAJOR = 1
SHARED_VERSION_MINOR = 19
SHARED_VERSION_MINOR = 20
SHARED_LIB1 = libleveldb.$(PLATFORM_SHARED_EXT)
SHARED_LIB2 = $(SHARED_LIB1).$(SHARED_VERSION_MAJOR)
SHARED_LIB3 = $(SHARED_LIB1).$(SHARED_VERSION_MAJOR).$(SHARED_VERSION_MINOR)
@ -337,6 +338,9 @@ $(STATIC_OUTDIR)/db_test:db/db_test.cc $(STATIC_LIBOBJECTS) $(TESTHARNESS) @@ -337,6 +338,9 @@ $(STATIC_OUTDIR)/db_test:db/db_test.cc $(STATIC_LIBOBJECTS) $(TESTHARNESS)
$(STATIC_OUTDIR)/dbformat_test:db/dbformat_test.cc $(STATIC_LIBOBJECTS) $(TESTHARNESS)
$(CXX) $(LDFLAGS) $(CXXFLAGS) db/dbformat_test.cc $(STATIC_LIBOBJECTS) $(TESTHARNESS) -o $@ $(LIBS)
$(STATIC_OUTDIR)/env_posix_test:util/env_posix_test.cc $(STATIC_LIBOBJECTS) $(TESTHARNESS)
$(CXX) $(LDFLAGS) $(CXXFLAGS) util/env_posix_test.cc $(STATIC_LIBOBJECTS) $(TESTHARNESS) -o $@ $(LIBS)
$(STATIC_OUTDIR)/env_test:util/env_test.cc $(STATIC_LIBOBJECTS) $(TESTHARNESS)
$(CXX) $(LDFLAGS) $(CXXFLAGS) util/env_test.cc $(STATIC_LIBOBJECTS) $(TESTHARNESS) -o $@ $(LIBS)
@ -412,3 +416,9 @@ $(SHARED_OUTDIR)/%.o: %.cc @@ -412,3 +416,9 @@ $(SHARED_OUTDIR)/%.o: %.cc
$(SHARED_OUTDIR)/%.o: %.c
$(CC) $(CFLAGS) $(PLATFORM_SHARED_CFLAGS) -c $< -o $@
$(STATIC_OUTDIR)/port/port_posix_sse.o: port/port_posix_sse.cc
$(CXX) $(CXXFLAGS) $(PLATFORM_SSEFLAGS) -c $< -o $@
$(SHARED_OUTDIR)/port/port_posix_sse.o: port/port_posix_sse.cc
$(CXX) $(CXXFLAGS) $(PLATFORM_SHARED_CFLAGS) $(PLATFORM_SSEFLAGS) -c $< -o $@

25
README.md

@ -16,7 +16,7 @@ Authors: Sanjay Ghemawat (sanjay@google.com) and Jeff Dean (jeff@google.com) @@ -16,7 +16,7 @@ Authors: Sanjay Ghemawat (sanjay@google.com) and Jeff Dean (jeff@google.com)
* External activity (file system operations etc.) is relayed through a virtual interface so users can customize the operating system interactions.
# Documentation
[LevelDB library documentation](https://rawgit.com/google/leveldb/master/doc/index.html) is online and bundled with the source code.
[LevelDB library documentation](https://github.com/google/leveldb/blob/master/doc/index.md) is online and bundled with the source code.
# Limitations
@ -113,29 +113,30 @@ by the one or two disk seeks needed to fetch the data from disk. @@ -113,29 +113,30 @@ by the one or two disk seeks needed to fetch the data from disk.
Write performance will be mostly unaffected by whether or not the
working set fits in memory.
readrandom : 16.677 micros/op; (approximately 60,000 reads per second)
readseq : 0.476 micros/op; 232.3 MB/s
readreverse : 0.724 micros/op; 152.9 MB/s
readrandom : 16.677 micros/op; (approximately 60,000 reads per second)
readseq : 0.476 micros/op; 232.3 MB/s
readreverse : 0.724 micros/op; 152.9 MB/s
LevelDB compacts its underlying storage data in the background to
improve read performance. The results listed above were done
immediately after a lot of random writes. The results after
compactions (which are usually triggered automatically) are better.
readrandom : 11.602 micros/op; (approximately 85,000 reads per second)
readseq : 0.423 micros/op; 261.8 MB/s
readreverse : 0.663 micros/op; 166.9 MB/s
readrandom : 11.602 micros/op; (approximately 85,000 reads per second)
readseq : 0.423 micros/op; 261.8 MB/s
readreverse : 0.663 micros/op; 166.9 MB/s
Some of the high cost of reads comes from repeated decompression of blocks
read from disk. If we supply enough cache to the leveldb so it can hold the
uncompressed blocks in memory, the read performance improves again:
readrandom : 9.775 micros/op; (approximately 100,000 reads per second before compaction)
readrandom : 5.215 micros/op; (approximately 190,000 reads per second after compaction)
readrandom : 9.775 micros/op; (approximately 100,000 reads per second before compaction)
readrandom : 5.215 micros/op; (approximately 190,000 reads per second after compaction)
## Repository contents
See doc/index.html for more explanation. See doc/impl.html for a brief overview of the implementation.
See [doc/index.md](doc/index.md) for more explanation. See
[doc/impl.md](doc/impl.md) for a brief overview of the implementation.
The public interface is in include/*.h. Callers should not include or
rely on the details of any other header files in this package. Those
@ -148,7 +149,7 @@ Guide to header files: @@ -148,7 +149,7 @@ Guide to header files:
* **include/options.h**: Control over the behavior of an entire database,
and also control over the behavior of individual reads and writes.
* **include/comparator.h**: Abstraction for user-specified comparison function.
* **include/comparator.h**: Abstraction for user-specified comparison function.
If you want just bytewise comparison of keys, you can use the default
comparator, but clients can write their own comparator implementations if they
want custom ordering (e.g. to handle different character encodings, etc.)
@ -165,7 +166,7 @@ length into some other byte array. @@ -165,7 +166,7 @@ length into some other byte array.
* **include/status.h**: Status is returned from many of the public interfaces
and is used to report success and various kinds of errors.
* **include/env.h**:
* **include/env.h**:
Abstraction of the OS environment. A posix implementation of this interface is
in util/env_posix.cc

30
build_detect_platform

@ -63,6 +63,7 @@ PLATFORM_SHARED_EXT="so" @@ -63,6 +63,7 @@ PLATFORM_SHARED_EXT="so"
PLATFORM_SHARED_LDFLAGS="-shared -Wl,-soname -Wl,"
PLATFORM_SHARED_CFLAGS="-fPIC"
PLATFORM_SHARED_VERSIONED=true
PLATFORM_SSEFLAGS=
MEMCMP_FLAG=
if [ "$CXX" = "g++" ]; then
@ -77,6 +78,7 @@ case "$TARGET_OS" in @@ -77,6 +78,7 @@ case "$TARGET_OS" in
COMMON_FLAGS="$MEMCMP_FLAG -lpthread -DOS_LINUX -DCYGWIN"
PLATFORM_LDFLAGS="-lpthread"
PORT_FILE=port/port_posix.cc
PORT_SSE_FILE=port/port_posix_sse.cc
;;
Darwin)
PLATFORM=OS_MACOSX
@ -85,24 +87,28 @@ case "$TARGET_OS" in @@ -85,24 +87,28 @@ case "$TARGET_OS" in
[ -z "$INSTALL_PATH" ] && INSTALL_PATH=`pwd`
PLATFORM_SHARED_LDFLAGS="-dynamiclib -install_name $INSTALL_PATH/"
PORT_FILE=port/port_posix.cc
PORT_SSE_FILE=port/port_posix_sse.cc
;;
Linux)
PLATFORM=OS_LINUX
COMMON_FLAGS="$MEMCMP_FLAG -pthread -DOS_LINUX"
PLATFORM_LDFLAGS="-pthread"
PORT_FILE=port/port_posix.cc
PORT_SSE_FILE=port/port_posix_sse.cc
;;
SunOS)
PLATFORM=OS_SOLARIS
COMMON_FLAGS="$MEMCMP_FLAG -D_REENTRANT -DOS_SOLARIS"
PLATFORM_LIBS="-lpthread -lrt"
PORT_FILE=port/port_posix.cc
PORT_SSE_FILE=port/port_posix_sse.cc
;;
FreeBSD)
PLATFORM=OS_FREEBSD
COMMON_FLAGS="$MEMCMP_FLAG -D_REENTRANT -DOS_FREEBSD"
PLATFORM_LIBS="-lpthread"
PORT_FILE=port/port_posix.cc
PORT_SSE_FILE=port/port_posix_sse.cc
;;
GNU/kFreeBSD)
PLATFORM=OS_KFREEBSD
@ -115,24 +121,28 @@ case "$TARGET_OS" in @@ -115,24 +121,28 @@ case "$TARGET_OS" in
COMMON_FLAGS="$MEMCMP_FLAG -D_REENTRANT -DOS_NETBSD"
PLATFORM_LIBS="-lpthread -lgcc_s"
PORT_FILE=port/port_posix.cc
PORT_SSE_FILE=port/port_posix_sse.cc
;;
OpenBSD)
PLATFORM=OS_OPENBSD
COMMON_FLAGS="$MEMCMP_FLAG -D_REENTRANT -DOS_OPENBSD"
PLATFORM_LDFLAGS="-pthread"
PORT_FILE=port/port_posix.cc
PORT_SSE_FILE=port/port_posix_sse.cc
;;
DragonFly)
PLATFORM=OS_DRAGONFLYBSD
COMMON_FLAGS="$MEMCMP_FLAG -D_REENTRANT -DOS_DRAGONFLYBSD"
PLATFORM_LIBS="-lpthread"
PORT_FILE=port/port_posix.cc
PORT_SSE_FILE=port/port_posix_sse.cc
;;
OS_ANDROID_CROSSCOMPILE)
PLATFORM=OS_ANDROID
COMMON_FLAGS="$MEMCMP_FLAG -D_REENTRANT -DOS_ANDROID -DLEVELDB_PLATFORM_POSIX"
PLATFORM_LDFLAGS="" # All pthread features are in the Android C library
PORT_FILE=port/port_posix.cc
PORT_SSE_FILE=port/port_posix_sse.cc
CROSS_COMPILE=true
;;
HP-UX)
@ -140,6 +150,7 @@ case "$TARGET_OS" in @@ -140,6 +150,7 @@ case "$TARGET_OS" in
COMMON_FLAGS="$MEMCMP_FLAG -D_REENTRANT -DOS_HPUX"
PLATFORM_LDFLAGS="-pthread"
PORT_FILE=port/port_posix.cc
PORT_SSE_FILE=port/port_posix_sse.cc
# man ld: +h internal_name
PLATFORM_SHARED_LDFLAGS="-shared -Wl,+h -Wl,"
;;
@ -148,6 +159,7 @@ case "$TARGET_OS" in @@ -148,6 +159,7 @@ case "$TARGET_OS" in
COMMON_FLAGS="$MEMCMP_FLAG -DOS_MACOSX"
[ -z "$INSTALL_PATH" ] && INSTALL_PATH=`pwd`
PORT_FILE=port/port_posix.cc
PORT_SSE_FILE=port/port_posix_sse.cc
PLATFORM_SHARED_EXT=
PLATFORM_SHARED_LDFLAGS=
PLATFORM_SHARED_CFLAGS=
@ -182,7 +194,7 @@ set +f # re-enable globbing @@ -182,7 +194,7 @@ set +f # re-enable globbing
# The sources consist of the portable files, plus the platform-specific port
# file.
echo "SOURCES=$PORTABLE_FILES $PORT_FILE" >> $OUTPUT
echo "SOURCES=$PORTABLE_FILES $PORT_FILE $PORT_SSE_FILE" >> $OUTPUT
echo "MEMENV_SOURCES=helpers/memenv/memenv.cc" >> $OUTPUT
if [ "$CROSS_COMPILE" = "true" ]; then
@ -213,6 +225,21 @@ EOF @@ -213,6 +225,21 @@ EOF
fi
rm -f $CXXOUTPUT 2>/dev/null
# Test if gcc SSE 4.2 is supported
$CXX $CXXFLAGS -x c++ - -o $CXXOUTPUT -msse4.2 2>/dev/null <<EOF
int main() {}
EOF
if [ "$?" = 0 ]; then
PLATFORM_SSEFLAGS="-msse4.2"
fi
rm -f $CXXOUTPUT 2>/dev/null
fi
# Use the SSE 4.2 CRC32C intrinsics iff runtime checks indicate compiler supports them.
if [ -n "$PLATFORM_SSEFLAGS" ]; then
PLATFORM_SSEFLAGS="$PLATFORM_SSEFLAGS -DLEVELDB_PLATFORM_POSIX_SSE"
fi
PLATFORM_CCFLAGS="$PLATFORM_CCFLAGS $COMMON_FLAGS"
@ -225,6 +252,7 @@ echo "PLATFORM_LDFLAGS=$PLATFORM_LDFLAGS" >> $OUTPUT @@ -225,6 +252,7 @@ echo "PLATFORM_LDFLAGS=$PLATFORM_LDFLAGS" >> $OUTPUT
echo "PLATFORM_LIBS=$PLATFORM_LIBS" >> $OUTPUT
echo "PLATFORM_CCFLAGS=$PLATFORM_CCFLAGS" >> $OUTPUT
echo "PLATFORM_CXXFLAGS=$PLATFORM_CXXFLAGS" >> $OUTPUT
echo "PLATFORM_SSEFLAGS=$PLATFORM_SSEFLAGS" >> $OUTPUT
echo "PLATFORM_SHARED_CFLAGS=$PLATFORM_SHARED_CFLAGS" >> $OUTPUT
echo "PLATFORM_SHARED_EXT=$PLATFORM_SHARED_EXT" >> $OUTPUT
echo "PLATFORM_SHARED_LDFLAGS=$PLATFORM_SHARED_LDFLAGS" >> $OUTPUT

38
db/db_bench.cc

@ -84,6 +84,14 @@ static bool FLAGS_histogram = false; @@ -84,6 +84,14 @@ static bool FLAGS_histogram = false;
// (initialized to default value by "main")
static int FLAGS_write_buffer_size = 0;
// Number of bytes written to each file.
// (initialized to default value by "main")
static int FLAGS_max_file_size = 0;
// Approximate size of user data packed per block (before compression.
// (initialized to default value by "main")
static int FLAGS_block_size = 0;
// Number of bytes to use as a cache of uncompressed data.
// Negative means use default settings.
static int FLAGS_cache_size = -1;
@ -109,6 +117,7 @@ static const char* FLAGS_db = NULL; @@ -109,6 +117,7 @@ static const char* FLAGS_db = NULL;
namespace leveldb {
namespace {
leveldb::Env* g_env = NULL;
// Helper for quickly generating random data.
class RandomGenerator {
@ -186,7 +195,7 @@ class Stats { @@ -186,7 +195,7 @@ class Stats {
done_ = 0;
bytes_ = 0;
seconds_ = 0;
start_ = Env::Default()->NowMicros();
start_ = g_env->NowMicros();
finish_ = start_;
message_.clear();
}
@ -204,7 +213,7 @@ class Stats { @@ -204,7 +213,7 @@ class Stats {
}
void Stop() {
finish_ = Env::Default()->NowMicros();
finish_ = g_env->NowMicros();
seconds_ = (finish_ - start_) * 1e-6;
}
@ -214,7 +223,7 @@ class Stats { @@ -214,7 +223,7 @@ class Stats {
void FinishedSingleOp() {
if (FLAGS_histogram) {
double now = Env::Default()->NowMicros();
double now = g_env->NowMicros();
double micros = now - last_op_finish_;
hist_.Add(micros);
if (micros > 20000) {
@ -404,10 +413,10 @@ class Benchmark { @@ -404,10 +413,10 @@ class Benchmark {
reads_(FLAGS_reads < 0 ? FLAGS_num : FLAGS_reads),
heap_counter_(0) {
std::vector<std::string> files;
Env::Default()->GetChildren(FLAGS_db, &files);
g_env->GetChildren(FLAGS_db, &files);
for (size_t i = 0; i < files.size(); i++) {
if (Slice(files[i]).starts_with("heap-")) {
Env::Default()->DeleteFile(std::string(FLAGS_db) + "/" + files[i]);
g_env->DeleteFile(std::string(FLAGS_db) + "/" + files[i]);
}
}
if (!FLAGS_use_existing_db) {
@ -589,7 +598,7 @@ class Benchmark { @@ -589,7 +598,7 @@ class Benchmark {
arg[i].shared = &shared;
arg[i].thread = new ThreadState(i);
arg[i].thread->shared = &shared;
Env::Default()->StartThread(ThreadBody, &arg[i]);
g_env->StartThread(ThreadBody, &arg[i]);
}
shared.mu.Lock();
@ -700,9 +709,12 @@ class Benchmark { @@ -700,9 +709,12 @@ class Benchmark {
void Open() {
assert(db_ == NULL);
Options options;
options.env = g_env;
options.create_if_missing = !FLAGS_use_existing_db;
options.block_cache = cache_;
options.write_buffer_size = FLAGS_write_buffer_size;
options.max_file_size = FLAGS_max_file_size;
options.block_size = FLAGS_block_size;
options.max_open_files = FLAGS_open_files;
options.filter_policy = filter_policy_;
options.reuse_logs = FLAGS_reuse_logs;
@ -925,7 +937,7 @@ class Benchmark { @@ -925,7 +937,7 @@ class Benchmark {
char fname[100];
snprintf(fname, sizeof(fname), "%s/heap-%04d", FLAGS_db, ++heap_counter_);
WritableFile* file;
Status s = Env::Default()->NewWritableFile(fname, &file);
Status s = g_env->NewWritableFile(fname, &file);
if (!s.ok()) {
fprintf(stderr, "%s\n", s.ToString().c_str());
return;
@ -934,7 +946,7 @@ class Benchmark { @@ -934,7 +946,7 @@ class Benchmark {
delete file;
if (!ok) {
fprintf(stderr, "heap profiling not supported\n");
Env::Default()->DeleteFile(fname);
g_env->DeleteFile(fname);
}
}
};
@ -943,6 +955,8 @@ class Benchmark { @@ -943,6 +955,8 @@ class Benchmark {
int main(int argc, char** argv) {
FLAGS_write_buffer_size = leveldb::Options().write_buffer_size;
FLAGS_max_file_size = leveldb::Options().max_file_size;
FLAGS_block_size = leveldb::Options().block_size;
FLAGS_open_files = leveldb::Options().max_open_files;
std::string default_db_path;
@ -973,6 +987,10 @@ int main(int argc, char** argv) { @@ -973,6 +987,10 @@ int main(int argc, char** argv) {
FLAGS_value_size = n;
} else if (sscanf(argv[i], "--write_buffer_size=%d%c", &n, &junk) == 1) {
FLAGS_write_buffer_size = n;
} else if (sscanf(argv[i], "--max_file_size=%d%c", &n, &junk) == 1) {
FLAGS_max_file_size = n;
} else if (sscanf(argv[i], "--block_size=%d%c", &n, &junk) == 1) {
FLAGS_block_size = n;
} else if (sscanf(argv[i], "--cache_size=%d%c", &n, &junk) == 1) {
FLAGS_cache_size = n;
} else if (sscanf(argv[i], "--bloom_bits=%d%c", &n, &junk) == 1) {
@ -987,9 +1005,11 @@ int main(int argc, char** argv) { @@ -987,9 +1005,11 @@ int main(int argc, char** argv) {
}
}
leveldb::g_env = leveldb::Env::Default();
// Choose a location for the test database if none given with --db=<path>
if (FLAGS_db == NULL) {
leveldb::Env::Default()->GetTestDirectory(&default_db_path);
leveldb::g_env->GetTestDirectory(&default_db_path);
default_db_path += "/dbbench";
FLAGS_db = default_db_path.c_str();
}

1
db/db_impl.cc

@ -96,6 +96,7 @@ Options SanitizeOptions(const std::string& dbname, @@ -96,6 +96,7 @@ Options SanitizeOptions(const std::string& dbname,
result.filter_policy = (src.filter_policy != NULL) ? ipolicy : NULL;
ClipToRange(&result.max_open_files, 64 + kNumNonTableCacheFiles, 50000);
ClipToRange(&result.write_buffer_size, 64<<10, 1<<30);
ClipToRange(&result.max_file_size, 1<<20, 1<<30);
ClipToRange(&result.block_size, 1<<10, 4<<20);
if (result.info_log == NULL) {
// Open a log file in the same directory as the db

2
db/log_format.h

@ -3,7 +3,7 @@ @@ -3,7 +3,7 @@
// found in the LICENSE file. See the AUTHORS file for names of contributors.
//
// Log format information shared by reader and writer.
// See ../doc/log_format.txt for more detail.
// See ../doc/log_format.md for more detail.
#ifndef STORAGE_LEVELDB_DB_LOG_FORMAT_H_
#define STORAGE_LEVELDB_DB_LOG_FORMAT_H_

57
db/version_set.cc

@ -20,21 +20,29 @@ @@ -20,21 +20,29 @@
namespace leveldb {
static const int kTargetFileSize = 2 * 1048576;
static int TargetFileSize(const Options* options) {
return options->max_file_size;
}
// Maximum bytes of overlaps in grandparent (i.e., level+2) before we
// stop building a single file in a level->level+1 compaction.
static const int64_t kMaxGrandParentOverlapBytes = 10 * kTargetFileSize;
static int64_t MaxGrandParentOverlapBytes(const Options* options) {
return 10 * TargetFileSize(options);
}
// Maximum number of bytes in all compacted files. We avoid expanding
// the lower level file set of a compaction if it would make the
// total compaction cover more than this many bytes.
static const int64_t kExpandedCompactionByteSizeLimit = 25 * kTargetFileSize;
static int64_t ExpandedCompactionByteSizeLimit(const Options* options) {
return 25 * TargetFileSize(options);
}
static double MaxBytesForLevel(int level) {
static double MaxBytesForLevel(const Options* options, int level) {
// Note: the result for level zero is not really used since we set
// the level-0 compaction threshold based on number of files.
double result = 10 * 1048576.0; // Result for both level-0 and level-1
// Result for both level-0 and level-1
double result = 10. * 1048576.0;
while (level > 1) {
result *= 10;
level--;
@ -42,8 +50,9 @@ static double MaxBytesForLevel(int level) { @@ -42,8 +50,9 @@ static double MaxBytesForLevel(int level) {
return result;
}
static uint64_t MaxFileSizeForLevel(int level) {
return kTargetFileSize; // We could vary per level to reduce number of files?
static uint64_t MaxFileSizeForLevel(const Options* options, int level) {
// We could vary per level to reduce number of files?
return TargetFileSize(options);
}
static int64_t TotalFileSize(const std::vector<FileMetaData*>& files) {
@ -508,7 +517,7 @@ int Version::PickLevelForMemTableOutput( @@ -508,7 +517,7 @@ int Version::PickLevelForMemTableOutput(
// Check that file does not overlap too many grandparent bytes.
GetOverlappingInputs(level + 2, &start, &limit, &overlaps);
const int64_t sum = TotalFileSize(overlaps);
if (sum > kMaxGrandParentOverlapBytes) {
if (sum > MaxGrandParentOverlapBytes(vset_->options_)) {
break;
}
}
@ -1027,7 +1036,7 @@ bool VersionSet::ReuseManifest(const std::string& dscname, @@ -1027,7 +1036,7 @@ bool VersionSet::ReuseManifest(const std::string& dscname,
manifest_type != kDescriptorFile ||
!env_->GetFileSize(dscname, &manifest_size).ok() ||
// Make new compacted MANIFEST if old one is too big
manifest_size >= kTargetFileSize) {
manifest_size >= TargetFileSize(options_)) {
return false;
}
@ -1076,7 +1085,8 @@ void VersionSet::Finalize(Version* v) { @@ -1076,7 +1085,8 @@ void VersionSet::Finalize(Version* v) {
} else {
// Compute the ratio of current size to size limit.
const uint64_t level_bytes = TotalFileSize(v->files_[level]);
score = static_cast<double>(level_bytes) / MaxBytesForLevel(level);
score =
static_cast<double>(level_bytes) / MaxBytesForLevel(options_, level);
}
if (score > best_score) {
@ -1290,7 +1300,7 @@ Compaction* VersionSet::PickCompaction() { @@ -1290,7 +1300,7 @@ Compaction* VersionSet::PickCompaction() {
level = current_->compaction_level_;
assert(level >= 0);
assert(level+1 < config::kNumLevels);
c = new Compaction(level);
c = new Compaction(options_, level);
// Pick the first file that comes after compact_pointer_[level]
for (size_t i = 0; i < current_->files_[level].size(); i++) {
@ -1307,7 +1317,7 @@ Compaction* VersionSet::PickCompaction() { @@ -1307,7 +1317,7 @@ Compaction* VersionSet::PickCompaction() {
}
} else if (seek_compaction) {
level = current_->file_to_compact_level_;
c = new Compaction(level);
c = new Compaction(options_, level);
c->inputs_[0].push_back(current_->file_to_compact_);
} else {
return NULL;
@ -1352,7 +1362,8 @@ void VersionSet::SetupOtherInputs(Compaction* c) { @@ -1352,7 +1362,8 @@ void VersionSet::SetupOtherInputs(Compaction* c) {
const int64_t inputs1_size = TotalFileSize(c->inputs_[1]);
const int64_t expanded0_size = TotalFileSize(expanded0);
if (expanded0.size() > c->inputs_[0].size() &&
inputs1_size + expanded0_size < kExpandedCompactionByteSizeLimit) {
inputs1_size + expanded0_size <
ExpandedCompactionByteSizeLimit(options_)) {
InternalKey new_start, new_limit;
GetRange(expanded0, &new_start, &new_limit);
std::vector<FileMetaData*> expanded1;
@ -1414,7 +1425,7 @@ Compaction* VersionSet::CompactRange( @@ -1414,7 +1425,7 @@ Compaction* VersionSet::CompactRange(
// and we must not pick one file and drop another older file if the
// two files overlap.
if (level > 0) {
const uint64_t limit = MaxFileSizeForLevel(level);
const uint64_t limit = MaxFileSizeForLevel(options_, level);
uint64_t total = 0;
for (size_t i = 0; i < inputs.size(); i++) {
uint64_t s = inputs[i]->file_size;
@ -1426,7 +1437,7 @@ Compaction* VersionSet::CompactRange( @@ -1426,7 +1437,7 @@ Compaction* VersionSet::CompactRange(
}
}
Compaction* c = new Compaction(level);
Compaction* c = new Compaction(options_, level);
c->input_version_ = current_;
c->input_version_->Ref();
c->inputs_[0] = inputs;
@ -1434,9 +1445,9 @@ Compaction* VersionSet::CompactRange( @@ -1434,9 +1445,9 @@ Compaction* VersionSet::CompactRange(
return c;
}
Compaction::Compaction(int level)
Compaction::Compaction(const Options* options, int level)
: level_(level),
max_output_file_size_(MaxFileSizeForLevel(level)),
max_output_file_size_(MaxFileSizeForLevel(options, level)),
input_version_(NULL),
grandparent_index_(0),
seen_key_(false),
@ -1453,12 +1464,13 @@ Compaction::~Compaction() { @@ -1453,12 +1464,13 @@ Compaction::~Compaction() {
}
bool Compaction::IsTrivialMove() const {
const VersionSet* vset = input_version_->vset_;
// Avoid a move if there is lots of overlapping grandparent data.
// Otherwise, the move could create a parent file that will require
// a very expensive merge later on.
return (num_input_files(0) == 1 &&
num_input_files(1) == 0 &&
TotalFileSize(grandparents_) <= kMaxGrandParentOverlapBytes);
return (num_input_files(0) == 1 && num_input_files(1) == 0 &&
TotalFileSize(grandparents_) <=
MaxGrandParentOverlapBytes(vset->options_));
}
void Compaction::AddInputDeletions(VersionEdit* edit) {
@ -1491,8 +1503,9 @@ bool Compaction::IsBaseLevelForKey(const Slice& user_key) { @@ -1491,8 +1503,9 @@ bool Compaction::IsBaseLevelForKey(const Slice& user_key) {
}
bool Compaction::ShouldStopBefore(const Slice& internal_key) {
const VersionSet* vset = input_version_->vset_;
// Scan to find earliest grandparent file that contains key.
const InternalKeyComparator* icmp = &input_version_->vset_->icmp_;
const InternalKeyComparator* icmp = &vset->icmp_;
while (grandparent_index_ < grandparents_.size() &&
icmp->Compare(internal_key,
grandparents_[grandparent_index_]->largest.Encode()) > 0) {
@ -1503,7 +1516,7 @@ bool Compaction::ShouldStopBefore(const Slice& internal_key) { @@ -1503,7 +1516,7 @@ bool Compaction::ShouldStopBefore(const Slice& internal_key) {
}
seen_key_ = true;
if (overlapped_bytes_ > kMaxGrandParentOverlapBytes) {
if (overlapped_bytes_ > MaxGrandParentOverlapBytes(vset->options_)) {
// Too much overlap for current output; start new output
overlapped_bytes_ = 0;
return true;

2
db/version_set.h

@ -366,7 +366,7 @@ class Compaction { @@ -366,7 +366,7 @@ class Compaction {
friend class Version;
friend class VersionSet;
explicit Compaction(int level);
Compaction(const Options* options, int level);
int level_;
uint64_t max_output_file_size_;

89
doc/doc.css

@ -1,89 +0,0 @@ @@ -1,89 +0,0 @@
body {
margin-left: 0.5in;
margin-right: 0.5in;
background: white;
color: black;
}
h1 {
margin-left: -0.2in;
font-size: 14pt;
}
h2 {
margin-left: -0in;
font-size: 12pt;
}
h3 {
margin-left: -0in;
}
h4 {
margin-left: -0in;
}
hr {
margin-left: -0in;
}
/* Definition lists: definition term bold */
dt {
font-weight: bold;
}
address {
text-align: center;
}
code,samp,var {
color: blue;
}
kbd {
color: #600000;
}
div.note p {
float: right;
width: 3in;
margin-right: 0%;
padding: 1px;
border: 2px solid #6060a0;
background-color: #fffff0;
}
ul {
margin-top: -0em;
margin-bottom: -0em;
}
ol {
margin-top: -0em;
margin-bottom: -0em;
}
UL.nobullets {
list-style-type: none;
list-style-image: none;
margin-left: -1em;
}
p {
margin: 1em 0 1em 0;
padding: 0 0 0 0;
}
pre {
line-height: 1.3em;
padding: 0.4em 0 0.8em 0;
margin: 0 0 0 0;
border: 0 0 0 0;
color: blue;
}
.datatable {
margin-left: auto;
margin-right: auto;
margin-top: 2em;
margin-bottom: 2em;
border: 1px solid;
}
.datatable td,th {
padding: 0 0.5em 0 0.5em;
text-align: right;
}

213
doc/impl.html

@ -1,213 +0,0 @@ @@ -1,213 +0,0 @@
<!DOCTYPE html>
<html>
<head>
<link rel="stylesheet" type="text/css" href="doc.css" />
<title>Leveldb file layout and compactions</title>
</head>
<body>
<h1>Files</h1>
The implementation of leveldb is similar in spirit to the
representation of a single
<a href="http://research.google.com/archive/bigtable.html">
Bigtable tablet (section 5.3)</a>.
However the organization of the files that make up the representation
is somewhat different and is explained below.
<p>
Each database is represented by a set of files stored in a directory.
There are several different types of files as documented below:
<p>
<h2>Log files</h2>
<p>
A log file (*.log) stores a sequence of recent updates. Each update
is appended to the current log file. When the log file reaches a
pre-determined size (approximately 4MB by default), it is converted
to a sorted table (see below) and a new log file is created for future
updates.
<p>
A copy of the current log file is kept in an in-memory structure (the
<code>memtable</code>). This copy is consulted on every read so that read
operations reflect all logged updates.
<p>
<h2>Sorted tables</h2>
<p>
A sorted table (*.sst) stores a sequence of entries sorted by key.
Each entry is either a value for the key, or a deletion marker for the
key. (Deletion markers are kept around to hide obsolete values
present in older sorted tables).
<p>
The set of sorted tables are organized into a sequence of levels. The
sorted table generated from a log file is placed in a special <code>young</code>
level (also called level-0). When the number of young files exceeds a
certain threshold (currently four), all of the young files are merged
together with all of the overlapping level-1 files to produce a
sequence of new level-1 files (we create a new level-1 file for every
2MB of data.)
<p>
Files in the young level may contain overlapping keys. However files
in other levels have distinct non-overlapping key ranges. Consider
level number L where L >= 1. When the combined size of files in
level-L exceeds (10^L) MB (i.e., 10MB for level-1, 100MB for level-2,
...), one file in level-L, and all of the overlapping files in
level-(L+1) are merged to form a set of new files for level-(L+1).
These merges have the effect of gradually migrating new updates from
the young level to the largest level using only bulk reads and writes
(i.e., minimizing expensive seeks).
<h2>Manifest</h2>
<p>
A MANIFEST file lists the set of sorted tables that make up each
level, the corresponding key ranges, and other important metadata.
A new MANIFEST file (with a new number embedded in the file name)
is created whenever the database is reopened. The MANIFEST file is
formatted as a log, and changes made to the serving state (as files
are added or removed) are appended to this log.
<p>
<h2>Current</h2>
<p>
CURRENT is a simple text file that contains the name of the latest
MANIFEST file.
<p>
<h2>Info logs</h2>
<p>
Informational messages are printed to files named LOG and LOG.old.
<p>
<h2>Others</h2>
<p>
Other files used for miscellaneous purposes may also be present
(LOCK, *.dbtmp).
<h1>Level 0</h1>
When the log file grows above a certain size (1MB by default):
<ul>
<li>Create a brand new memtable and log file and direct future updates here
<li>In the background:
<ul>
<li>Write the contents of the previous memtable to an sstable
<li>Discard the memtable
<li>Delete the old log file and the old memtable
<li>Add the new sstable to the young (level-0) level.
</ul>
</ul>
<h1>Compactions</h1>
<p>
When the size of level L exceeds its limit, we compact it in a
background thread. The compaction picks a file from level L and all
overlapping files from the next level L+1. Note that if a level-L
file overlaps only part of a level-(L+1) file, the entire file at
level-(L+1) is used as an input to the compaction and will be
discarded after the compaction. Aside: because level-0 is special
(files in it may overlap each other), we treat compactions from
level-0 to level-1 specially: a level-0 compaction may pick more than
one level-0 file in case some of these files overlap each other.
<p>
A compaction merges the contents of the picked files to produce a
sequence of level-(L+1) files. We switch to producing a new
level-(L+1) file after the current output file has reached the target
file size (2MB). We also switch to a new output file when the key
range of the current output file has grown enough to overlap more than
ten level-(L+2) files. This last rule ensures that a later compaction
of a level-(L+1) file will not pick up too much data from level-(L+2).
<p>
The old files are discarded and the new files are added to the serving
state.
<p>
Compactions for a particular level rotate through the key space. In
more detail, for each level L, we remember the ending key of the last
compaction at level L. The next compaction for level L will pick the
first file that starts after this key (wrapping around to the
beginning of the key space if there is no such file).
<p>
Compactions drop overwritten values. They also drop deletion markers
if there are no higher numbered levels that contain a file whose range
overlaps the current key.
<h2>Timing</h2>
Level-0 compactions will read up to four 1MB files from level-0, and
at worst all the level-1 files (10MB). I.e., we will read 14MB and
write 14MB.
<p>
Other than the special level-0 compactions, we will pick one 2MB file
from level L. In the worst case, this will overlap ~ 12 files from
level L+1 (10 because level-(L+1) is ten times the size of level-L,
and another two at the boundaries since the file ranges at level-L
will usually not be aligned with the file ranges at level-L+1). The
compaction will therefore read 26MB and write 26MB. Assuming a disk
IO rate of 100MB/s (ballpark range for modern drives), the worst
compaction cost will be approximately 0.5 second.
<p>
If we throttle the background writing to something small, say 10% of
the full 100MB/s speed, a compaction may take up to 5 seconds. If the
user is writing at 10MB/s, we might build up lots of level-0 files
(~50 to hold the 5*10MB). This may significantly increase the cost of
reads due to the overhead of merging more files together on every
read.
<p>
Solution 1: To reduce this problem, we might want to increase the log
switching threshold when the number of level-0 files is large. Though
the downside is that the larger this threshold, the more memory we will
need to hold the corresponding memtable.
<p>
Solution 2: We might want to decrease write rate artificially when the
number of level-0 files goes up.
<p>
Solution 3: We work on reducing the cost of very wide merges.
Perhaps most of the level-0 files will have their blocks sitting
uncompressed in the cache and we will only need to worry about the
O(N) complexity in the merging iterator.
<h2>Number of files</h2>
Instead of always making 2MB files, we could make larger files for
larger levels to reduce the total file count, though at the expense of
more bursty compactions. Alternatively, we could shard the set of
files into multiple directories.
<p>
An experiment on an <code>ext3</code> filesystem on Feb 04, 2011 shows
the following timings to do 100K file opens in directories with
varying number of files:
<table class="datatable">
<tr><th>Files in directory</th><th>Microseconds to open a file</th></tr>
<tr><td>1000</td><td>9</td>
<tr><td>10000</td><td>10</td>
<tr><td>100000</td><td>16</td>
</table>
So maybe even the sharding is not necessary on modern filesystems?
<h1>Recovery</h1>
<ul>
<li> Read CURRENT to find name of the latest committed MANIFEST
<li> Read the named MANIFEST file
<li> Clean up stale files
<li> We could open all sstables here, but it is probably better to be lazy...
<li> Convert log chunk to a new level-0 sstable
<li> Start directing new writes to a new log file with recovered sequence#
</ul>
<h1>Garbage collection of files</h1>
<code>DeleteObsoleteFiles()</code> is called at the end of every
compaction and at the end of recovery. It finds the names of all
files in the database. It deletes all log files that are not the
current log file. It deletes all table files that are not referenced
from some level and are not the output of an active compaction.
</body>
</html>

170
doc/impl.md

@ -0,0 +1,170 @@ @@ -0,0 +1,170 @@
## Files
The implementation of leveldb is similar in spirit to the representation of a
single [Bigtable tablet (section 5.3)](http://research.google.com/archive/bigtable.html).
However the organization of the files that make up the representation is
somewhat different and is explained below.
Each database is represented by a set of files stored in a directory. There are
several different types of files as documented below:
### Log files
A log file (*.log) stores a sequence of recent updates. Each update is appended
to the current log file. When the log file reaches a pre-determined size
(approximately 4MB by default), it is converted to a sorted table (see below)
and a new log file is created for future updates.
A copy of the current log file is kept in an in-memory structure (the
`memtable`). This copy is consulted on every read so that read operations
reflect all logged updates.
## Sorted tables
A sorted table (*.ldb) stores a sequence of entries sorted by key. Each entry is
either a value for the key, or a deletion marker for the key. (Deletion markers
are kept around to hide obsolete values present in older sorted tables).
The set of sorted tables are organized into a sequence of levels. The sorted
table generated from a log file is placed in a special **young** level (also
called level-0). When the number of young files exceeds a certain threshold
(currently four), all of the young files are merged together with all of the
overlapping level-1 files to produce a sequence of new level-1 files (we create
a new level-1 file for every 2MB of data.)
Files in the young level may contain overlapping keys. However files in other
levels have distinct non-overlapping key ranges. Consider level number L where
L >= 1. When the combined size of files in level-L exceeds (10^L) MB (i.e., 10MB
for level-1, 100MB for level-2, ...), one file in level-L, and all of the
overlapping files in level-(L+1) are merged to form a set of new files for
level-(L+1). These merges have the effect of gradually migrating new updates
from the young level to the largest level using only bulk reads and writes
(i.e., minimizing expensive seeks).
### Manifest
A MANIFEST file lists the set of sorted tables that make up each level, the
corresponding key ranges, and other important metadata. A new MANIFEST file
(with a new number embedded in the file name) is created whenever the database
is reopened. The MANIFEST file is formatted as a log, and changes made to the
serving state (as files are added or removed) are appended to this log.
### Current
CURRENT is a simple text file that contains the name of the latest MANIFEST
file.
### Info logs
Informational messages are printed to files named LOG and LOG.old.
### Others
Other files used for miscellaneous purposes may also be present (LOCK, *.dbtmp).
## Level 0
When the log file grows above a certain size (1MB by default):
Create a brand new memtable and log file and direct future updates here
In the background:
Write the contents of the previous memtable to an sstable
Discard the memtable
Delete the old log file and the old memtable
Add the new sstable to the young (level-0) level.
## Compactions
When the size of level L exceeds its limit, we compact it in a background
thread. The compaction picks a file from level L and all overlapping files from
the next level L+1. Note that if a level-L file overlaps only part of a
level-(L+1) file, the entire file at level-(L+1) is used as an input to the
compaction and will be discarded after the compaction. Aside: because level-0
is special (files in it may overlap each other), we treat compactions from
level-0 to level-1 specially: a level-0 compaction may pick more than one
level-0 file in case some of these files overlap each other.
A compaction merges the contents of the picked files to produce a sequence of
level-(L+1) files. We switch to producing a new level-(L+1) file after the
current output file has reached the target file size (2MB). We also switch to a
new output file when the key range of the current output file has grown enough
to overlap more than ten level-(L+2) files. This last rule ensures that a later
compaction of a level-(L+1) file will not pick up too much data from
level-(L+2).
The old files are discarded and the new files are added to the serving state.
Compactions for a particular level rotate through the key space. In more detail,
for each level L, we remember the ending key of the last compaction at level L.
The next compaction for level L will pick the first file that starts after this
key (wrapping around to the beginning of the key space if there is no such
file).
Compactions drop overwritten values. They also drop deletion markers if there
are no higher numbered levels that contain a file whose range overlaps the
current key.
### Timing
Level-0 compactions will read up to four 1MB files from level-0, and at worst
all the level-1 files (10MB). I.e., we will read 14MB and write 14MB.
Other than the special level-0 compactions, we will pick one 2MB file from level
L. In the worst case, this will overlap ~ 12 files from level L+1 (10 because
level-(L+1) is ten times the size of level-L, and another two at the boundaries
since the file ranges at level-L will usually not be aligned with the file
ranges at level-L+1). The compaction will therefore read 26MB and write 26MB.
Assuming a disk IO rate of 100MB/s (ballpark range for modern drives), the worst
compaction cost will be approximately 0.5 second.
If we throttle the background writing to something small, say 10% of the full
100MB/s speed, a compaction may take up to 5 seconds. If the user is writing at
10MB/s, we might build up lots of level-0 files (~50 to hold the 5*10MB). This
may significantly increase the cost of reads due to the overhead of merging more
files together on every read.
Solution 1: To reduce this problem, we might want to increase the log switching
threshold when the number of level-0 files is large. Though the downside is that
the larger this threshold, the more memory we will need to hold the
corresponding memtable.
Solution 2: We might want to decrease write rate artificially when the number of
level-0 files goes up.
Solution 3: We work on reducing the cost of very wide merges. Perhaps most of
the level-0 files will have their blocks sitting uncompressed in the cache and
we will only need to worry about the O(N) complexity in the merging iterator.
### Number of files
Instead of always making 2MB files, we could make larger files for larger levels
to reduce the total file count, though at the expense of more bursty
compactions. Alternatively, we could shard the set of files into multiple
directories.
An experiment on an ext3 filesystem on Feb 04, 2011 shows the following timings
to do 100K file opens in directories with varying number of files:
| Files in directory | Microseconds to open a file |
|-------------------:|----------------------------:|
| 1000 | 9 |
| 10000 | 10 |
| 100000 | 16 |
So maybe even the sharding is not necessary on modern filesystems?
## Recovery
* Read CURRENT to find name of the latest committed MANIFEST
* Read the named MANIFEST file
* Clean up stale files
* We could open all sstables here, but it is probably better to be lazy...
* Convert log chunk to a new level-0 sstable
* Start directing new writes to a new log file with recovered sequence#
## Garbage collection of files
`DeleteObsoleteFiles()` is called at the end of every compaction and at the end
of recovery. It finds the names of all files in the database. It deletes all log
files that are not the current log file. It deletes all table files that are not
referenced from some level and are not the output of an active compaction.

549
doc/index.html

@ -1,549 +0,0 @@ @@ -1,549 +0,0 @@
<!DOCTYPE html>
<html>
<head>
<link rel="stylesheet" type="text/css" href="doc.css" />
<title>Leveldb</title>
</head>
<body>
<h1>Leveldb</h1>
<address>Jeff Dean, Sanjay Ghemawat</address>
<p>
The <code>leveldb</code> library provides a persistent key value store. Keys and
values are arbitrary byte arrays. The keys are ordered within the key
value store according to a user-specified comparator function.
<p>
<h1>Opening A Database</h1>
<p>
A <code>leveldb</code> database has a name which corresponds to a file system
directory. All of the contents of database are stored in this
directory. The following example shows how to open a database,
creating it if necessary:
<p>
<pre>
#include &lt;cassert&gt;
#include "leveldb/db.h"
leveldb::DB* db;
leveldb::Options options;
options.create_if_missing = true;
leveldb::Status status = leveldb::DB::Open(options, "/tmp/testdb", &amp;db);
assert(status.ok());
...
</pre>
If you want to raise an error if the database already exists, add
the following line before the <code>leveldb::DB::Open</code> call:
<pre>
options.error_if_exists = true;
</pre>
<h1>Status</h1>
<p>
You may have noticed the <code>leveldb::Status</code> type above. Values of this
type are returned by most functions in <code>leveldb</code> that may encounter an
error. You can check if such a result is ok, and also print an
associated error message:
<p>
<pre>
leveldb::Status s = ...;
if (!s.ok()) cerr &lt;&lt; s.ToString() &lt;&lt; endl;
</pre>
<h1>Closing A Database</h1>
<p>
When you are done with a database, just delete the database object.
Example:
<p>
<pre>
... open the db as described above ...
... do something with db ...
delete db;
</pre>
<h1>Reads And Writes</h1>
<p>
The database provides <code>Put</code>, <code>Delete</code>, and <code>Get</code> methods to
modify/query the database. For example, the following code
moves the value stored under key1 to key2.
<pre>
std::string value;
leveldb::Status s = db-&gt;Get(leveldb::ReadOptions(), key1, &amp;value);
if (s.ok()) s = db-&gt;Put(leveldb::WriteOptions(), key2, value);
if (s.ok()) s = db-&gt;Delete(leveldb::WriteOptions(), key1);
</pre>
<h1>Atomic Updates</h1>
<p>
Note that if the process dies after the Put of key2 but before the
delete of key1, the same value may be left stored under multiple keys.
Such problems can be avoided by using the <code>WriteBatch</code> class to
atomically apply a set of updates:
<p>
<pre>
#include "leveldb/write_batch.h"
...
std::string value;
leveldb::Status s = db-&gt;Get(leveldb::ReadOptions(), key1, &amp;value);
if (s.ok()) {
leveldb::WriteBatch batch;
batch.Delete(key1);
batch.Put(key2, value);
s = db-&gt;Write(leveldb::WriteOptions(), &amp;batch);
}
</pre>
The <code>WriteBatch</code> holds a sequence of edits to be made to the database,
and these edits within the batch are applied in order. Note that we
called <code>Delete</code> before <code>Put</code> so that if <code>key1</code> is identical to <code>key2</code>,
we do not end up erroneously dropping the value entirely.
<p>
Apart from its atomicity benefits, <code>WriteBatch</code> may also be used to
speed up bulk updates by placing lots of individual mutations into the
same batch.
<h1>Synchronous Writes</h1>
By default, each write to <code>leveldb</code> is asynchronous: it
returns after pushing the write from the process into the operating
system. The transfer from operating system memory to the underlying
persistent storage happens asynchronously. The <code>sync</code> flag
can be turned on for a particular write to make the write operation
not return until the data being written has been pushed all the way to
persistent storage. (On Posix systems, this is implemented by calling
either <code>fsync(...)</code> or <code>fdatasync(...)</code> or
<code>msync(..., MS_SYNC)</code> before the write operation returns.)
<pre>
leveldb::WriteOptions write_options;
write_options.sync = true;
db-&gt;Put(write_options, ...);
</pre>
Asynchronous writes are often more than a thousand times as fast as
synchronous writes. The downside of asynchronous writes is that a
crash of the machine may cause the last few updates to be lost. Note
that a crash of just the writing process (i.e., not a reboot) will not
cause any loss since even when <code>sync</code> is false, an update
is pushed from the process memory into the operating system before it
is considered done.
<p>
Asynchronous writes can often be used safely. For example, when
loading a large amount of data into the database you can handle lost
updates by restarting the bulk load after a crash. A hybrid scheme is
also possible where every Nth write is synchronous, and in the event
of a crash, the bulk load is restarted just after the last synchronous
write finished by the previous run. (The synchronous write can update
a marker that describes where to restart on a crash.)
<p>
<code>WriteBatch</code> provides an alternative to asynchronous writes.
Multiple updates may be placed in the same <code>WriteBatch</code> and
applied together using a synchronous write (i.e.,
<code>write_options.sync</code> is set to true). The extra cost of
the synchronous write will be amortized across all of the writes in
the batch.
<p>
<h1>Concurrency</h1>
<p>
A database may only be opened by one process at a time.
The <code>leveldb</code> implementation acquires a lock from the
operating system to prevent misuse. Within a single process, the
same <code>leveldb::DB</code> object may be safely shared by multiple
concurrent threads. I.e., different threads may write into or fetch
iterators or call <code>Get</code> on the same database without any
external synchronization (the leveldb implementation will
automatically do the required synchronization). However other objects
(like Iterator and WriteBatch) may require external synchronization.
If two threads share such an object, they must protect access to it
using their own locking protocol. More details are available in
the public header files.
<p>
<h1>Iteration</h1>
<p>
The following example demonstrates how to print all key,value pairs
in a database.
<p>
<pre>
leveldb::Iterator* it = db-&gt;NewIterator(leveldb::ReadOptions());
for (it-&gt;SeekToFirst(); it-&gt;Valid(); it-&gt;Next()) {
cout &lt;&lt; it-&gt;key().ToString() &lt;&lt; ": " &lt;&lt; it-&gt;value().ToString() &lt;&lt; endl;
}
assert(it-&gt;status().ok()); // Check for any errors found during the scan
delete it;
</pre>
The following variation shows how to process just the keys in the
range <code>[start,limit)</code>:
<p>
<pre>
for (it-&gt;Seek(start);
it-&gt;Valid() &amp;&amp; it-&gt;key().ToString() &lt; limit;
it-&gt;Next()) {
...
}
</pre>
You can also process entries in reverse order. (Caveat: reverse
iteration may be somewhat slower than forward iteration.)
<p>
<pre>
for (it-&gt;SeekToLast(); it-&gt;Valid(); it-&gt;Prev()) {
...
}
</pre>
<h1>Snapshots</h1>
<p>
Snapshots provide consistent read-only views over the entire state of
the key-value store. <code>ReadOptions::snapshot</code> may be non-NULL to indicate
that a read should operate on a particular version of the DB state.
If <code>ReadOptions::snapshot</code> is NULL, the read will operate on an
implicit snapshot of the current state.
<p>
Snapshots are created by the DB::GetSnapshot() method:
<p>
<pre>
leveldb::ReadOptions options;
options.snapshot = db-&gt;GetSnapshot();
... apply some updates to db ...
leveldb::Iterator* iter = db-&gt;NewIterator(options);
... read using iter to view the state when the snapshot was created ...
delete iter;
db-&gt;ReleaseSnapshot(options.snapshot);
</pre>
Note that when a snapshot is no longer needed, it should be released
using the DB::ReleaseSnapshot interface. This allows the
implementation to get rid of state that was being maintained just to
support reading as of that snapshot.
<h1>Slice</h1>
<p>
The return value of the <code>it->key()</code> and <code>it->value()</code> calls above
are instances of the <code>leveldb::Slice</code> type. <code>Slice</code> is a simple
structure that contains a length and a pointer to an external byte
array. Returning a <code>Slice</code> is a cheaper alternative to returning a
<code>std::string</code> since we do not need to copy potentially large keys and
values. In addition, <code>leveldb</code> methods do not return null-terminated
C-style strings since <code>leveldb</code> keys and values are allowed to
contain '\0' bytes.
<p>
C++ strings and null-terminated C-style strings can be easily converted
to a Slice:
<p>
<pre>
leveldb::Slice s1 = "hello";
std::string str("world");
leveldb::Slice s2 = str;
</pre>
A Slice can be easily converted back to a C++ string:
<pre>
std::string str = s1.ToString();
assert(str == std::string("hello"));
</pre>
Be careful when using Slices since it is up to the caller to ensure that
the external byte array into which the Slice points remains live while
the Slice is in use. For example, the following is buggy:
<p>
<pre>
leveldb::Slice slice;
if (...) {
std::string str = ...;
slice = str;
}
Use(slice);
</pre>
When the <code>if</code> statement goes out of scope, <code>str</code> will be destroyed and the
backing storage for <code>slice</code> will disappear.
<p>
<h1>Comparators</h1>
<p>
The preceding examples used the default ordering function for key,
which orders bytes lexicographically. You can however supply a custom
comparator when opening a database. For example, suppose each
database key consists of two numbers and we should sort by the first
number, breaking ties by the second number. First, define a proper
subclass of <code>leveldb::Comparator</code> that expresses these rules:
<p>
<pre>
class TwoPartComparator : public leveldb::Comparator {
public:
// Three-way comparison function:
// if a &lt; b: negative result
// if a &gt; b: positive result
// else: zero result
int Compare(const leveldb::Slice&amp; a, const leveldb::Slice&amp; b) const {
int a1, a2, b1, b2;
ParseKey(a, &amp;a1, &amp;a2);
ParseKey(b, &amp;b1, &amp;b2);
if (a1 &lt; b1) return -1;
if (a1 &gt; b1) return +1;
if (a2 &lt; b2) return -1;
if (a2 &gt; b2) return +1;
return 0;
}
// Ignore the following methods for now:
const char* Name() const { return "TwoPartComparator"; }
void FindShortestSeparator(std::string*, const leveldb::Slice&amp;) const { }
void FindShortSuccessor(std::string*) const { }
};
</pre>
Now create a database using this custom comparator:
<p>
<pre>
TwoPartComparator cmp;
leveldb::DB* db;
leveldb::Options options;
options.create_if_missing = true;
options.comparator = &amp;cmp;
leveldb::Status status = leveldb::DB::Open(options, "/tmp/testdb", &amp;db);
...
</pre>
<h2>Backwards compatibility</h2>
<p>
The result of the comparator's <code>Name</code> method is attached to the
database when it is created, and is checked on every subsequent
database open. If the name changes, the <code>leveldb::DB::Open</code> call will
fail. Therefore, change the name if and only if the new key format
and comparison function are incompatible with existing databases, and
it is ok to discard the contents of all existing databases.
<p>
You can however still gradually evolve your key format over time with
a little bit of pre-planning. For example, you could store a version
number at the end of each key (one byte should suffice for most uses).
When you wish to switch to a new key format (e.g., adding an optional
third part to the keys processed by <code>TwoPartComparator</code>),
(a) keep the same comparator name (b) increment the version number
for new keys (c) change the comparator function so it uses the
version numbers found in the keys to decide how to interpret them.
<p>
<h1>Performance</h1>
<p>
Performance can be tuned by changing the default values of the
types defined in <code>include/leveldb/options.h</code>.
<p>
<h2>Block size</h2>
<p>
<code>leveldb</code> groups adjacent keys together into the same block and such a
block is the unit of transfer to and from persistent storage. The
default block size is approximately 4096 uncompressed bytes.
Applications that mostly do bulk scans over the contents of the
database may wish to increase this size. Applications that do a lot
of point reads of small values may wish to switch to a smaller block
size if performance measurements indicate an improvement. There isn't
much benefit in using blocks smaller than one kilobyte, or larger than
a few megabytes. Also note that compression will be more effective
with larger block sizes.
<p>
<h2>Compression</h2>
<p>
Each block is individually compressed before being written to
persistent storage. Compression is on by default since the default
compression method is very fast, and is automatically disabled for
uncompressible data. In rare cases, applications may want to disable
compression entirely, but should only do so if benchmarks show a
performance improvement:
<p>
<pre>
leveldb::Options options;
options.compression = leveldb::kNoCompression;
... leveldb::DB::Open(options, name, ...) ....
</pre>
<h2>Cache</h2>
<p>
The contents of the database are stored in a set of files in the
filesystem and each file stores a sequence of compressed blocks. If
<code>options.cache</code> is non-NULL, it is used to cache frequently used
uncompressed block contents.
<p>
<pre>
#include "leveldb/cache.h"
leveldb::Options options;
options.cache = leveldb::NewLRUCache(100 * 1048576); // 100MB cache
leveldb::DB* db;
leveldb::DB::Open(options, name, &db);
... use the db ...
delete db
delete options.cache;
</pre>
Note that the cache holds uncompressed data, and therefore it should
be sized according to application level data sizes, without any
reduction from compression. (Caching of compressed blocks is left to
the operating system buffer cache, or any custom <code>Env</code>
implementation provided by the client.)
<p>
When performing a bulk read, the application may wish to disable
caching so that the data processed by the bulk read does not end up
displacing most of the cached contents. A per-iterator option can be
used to achieve this:
<p>
<pre>
leveldb::ReadOptions options;
options.fill_cache = false;
leveldb::Iterator* it = db-&gt;NewIterator(options);
for (it-&gt;SeekToFirst(); it-&gt;Valid(); it-&gt;Next()) {
...
}
</pre>
<h2>Key Layout</h2>
<p>
Note that the unit of disk transfer and caching is a block. Adjacent
keys (according to the database sort order) will usually be placed in
the same block. Therefore the application can improve its performance
by placing keys that are accessed together near each other and placing
infrequently used keys in a separate region of the key space.
<p>
For example, suppose we are implementing a simple file system on top
of <code>leveldb</code>. The types of entries we might wish to store are:
<p>
<pre>
filename -&gt; permission-bits, length, list of file_block_ids
file_block_id -&gt; data
</