From e9064cbac555a9cb0d4ab8c80427338c3b764bc1 Mon Sep 17 00:00:00 2001 From: Will Drewry Date: Wed, 9 Jun 2010 17:06:22 -0500 Subject: [PATCH] CHROMIUM: dm-verity: initial cut First cut of dm-verity and dm-bht. This implements a hash-based block device verification dm module. [It still needs work, but I'd love to be able to start doing iterative changes and not working on big, unchecked in blobs :)] BUG=327 TESTS=automated: - dm-bht are coming in the verity cl - autotests need to be written for testing live manual: - by mounting /dev/sda5 with a rootfs on it and using a loop device with the hashes; - ran with slub debug FPZ: discovered dm-io bug which is no longer used in the code - ran dd on the raw device with block sizes from 4k - 1M and counts from 1 to full disk, skip as well - uncovered lock-ups in io_populate loop and oom conditions with initial dm_bht_populate flood of I/O requests (max_bios) - base tests on ref show ~25mb/s. autotests to follow Signed-off-by: Will Drewry Review URL: http://codereview.chromium.org/1921007 --- Documentation/device-mapper/dm-bht.txt | 73 + Documentation/device-mapper/dm-verity.txt | 71 + drivers/md/Kconfig | 38 + drivers/md/Makefile | 2 + drivers/md/dm-bht.c | 1305 +++++++++++++++++ drivers/md/dm-verity.c | 1558 +++++++++++++++++++++ include/linux/dm-bht.h | 153 ++ 7 files changed, 3200 insertions(+) create mode 100644 Documentation/device-mapper/dm-bht.txt create mode 100644 Documentation/device-mapper/dm-verity.txt create mode 100644 drivers/md/dm-bht.c create mode 100644 drivers/md/dm-verity.c create mode 100644 include/linux/dm-bht.h diff --git a/Documentation/device-mapper/dm-bht.txt b/Documentation/device-mapper/dm-bht.txt new file mode 100644 index 000000000000..69bab0272a68 --- /dev/null +++ b/Documentation/device-mapper/dm-bht.txt @@ -0,0 +1,73 @@ +dm-bht +====== + +dm-bht provides a block hash tree implementation. The use of dm-bht allows +for integrity checking of a given block device without reading the entire +set of blocks into memory before use. + +In particular, dm-bht supplies an interface for creating and verifying a tree +of cryptographic digests with any algorithm supported by the kernel crypto API. + +The code is meant to be usable from user-space for creation and verification as +well as directly from a Device-Mapper target. The `verity' target is the +motivating example. + + +Theory of operation +=================== + +dm-bht is logically comprised of multiple nodes organized in a tree-like +structure. Each node in the tree is a cryptographic hash. If it is a leaf +node, the hash is of some block data on disk. If it is an intermediary node, +then the hash is of a number of child nodes. + +dm-bht has a given depth starting at 1 (ignoring the root node). Each level in +the tree is concretely made up of dm_bht_entry structs. Each entry in the tree +is a collection of neighboring nodes that fit in one page-sized block. The +number is determined based on PAGE_SIZE and the size of the selected +cryptographic digest algorithm. The hashes are linearly ordered in this entry +and any unaligned trailing space is ignored but included when calculating the +parent node. + +The tree looks something like: + +depth = 2, alg= sha256, num_blocks = 32767 + [ root ] + / . . . \ + [entry_0] [entry_1] + / . . . \ . . . \ + [entry_0_0] . . . [entry_0_127] . . . . [entry_1_127] + / ... \ / . . . \ / \ + blk_0 ... blk_127 blk_16256 blk_16383 blk_32640 . . . blk_32767 + +root is treated independently from the depth and the blocks are expected to +be hashed and supplied to the dm-bht. hash blocks that make up the entry +contents are expected to be read from disk. + +dm-bht does not handle I/O directly but instead expects the consumer to +supply callbacks. The read callback will always receive a page-align value +to pass to the block device layer to read in a hash value. + +Usage +===== + +The API provides mechanisms for reading and verifying a tree as well as +creating and modifying the tree. These two code paths were not meant to be +used in parallel and modify the atomic entry values in incompatible ways. +Where possible, tree creation and modification should be handled independently +from tree verification. + +When reading, all required data for the hash tree should be populated for a +block before attempting a verify. This can be done by calling +dm_bht_populate(). When all data is ready, a call to dm_bht_verify_block() +with the expected hash value will perform both the direct block hash check and +the hashes of the parent and neighboring nodes where needed to ensure validity +up to the root hash. Note, dm_bht_set_root_hexdigest() should be called before +any verification attempts occur. + +When updating the tree, all block hashes should be stored with +dm_bht_store_block(). Once all hashes are stored, a call to dm_bht_compute() +will initiate a full tree update by walking all of the blocks of hashes +starting at the leaf nodes and computing upward to the root node. On +completion, dm_bht_sync() may be called to write the tree to disk (or wherever +the callback writes to). diff --git a/Documentation/device-mapper/dm-verity.txt b/Documentation/device-mapper/dm-verity.txt new file mode 100644 index 000000000000..ee471d9f150b --- /dev/null +++ b/Documentation/device-mapper/dm-verity.txt @@ -0,0 +1,71 @@ +dm-verity +========== + +Device-Mapper's "verity" target provides transparent integrity checking of +block devices using a cryptographic digest provided by the kernel crypto API. +This target is read-only. + +Parameters: + + + This is the device that is going to be integrity checked. It may be + a subset of the full device as specified to dmsetup (start sector and count) + It may be specified as a path, like /dev/sdaX, or a device number, + :. + + + This is the device that that supplies the dm-bht hash data. It may be + specified similarly to the device path and may be the same device. If the + same device is used, the hash offset should be outside of the dm-verity + configured device size. + + + The tree depth determines how many levels of hashes are used when building + the tree of hashes. The root of the tree not included and the leaves of + the tree are the hashes of the blocks on disk. + + + The cryptographic hash algorithm used for this device. This should + be the name of the algorithm, like "sha1". + + + The hexadecimal encoding of the cryptographic hash of all of the + neighboring nodes at the first level of the tree. This hash should be + trusted as there is no other authenticity beyond this point. + + +Theory of operation +=================== + +dm-verity is meant to be setup as part of a verified boot path. This +may be anything ranging from a boot using tboot or trustedgrub to just +booting from a known-good device (like a USB drive or CD). + +When a dm-verity device is configured, it is expected that the caller +has been authenticated in some way (cryptographic signatures, etc). +After instantiation, all hashes will be verified on-demand during +disk access. If they cannot be verified up to the root node of the +tree, the root hash, then the I/O will fail. This should identify +tampering with any data on the device and the hash data. + +Cryptographic hashes are used to assert the integrity of the device on a +per-block basis. This allows for a lightweight hash computation on first read +into the page cache. Block hashes are stored linearly aligned to the nearest +block the size of a page. + +For more information on the hashing process, see dm-bht.txt. + + +Example +======= + +Setup a device; +[[ + dmsetup create vroot --table \ + "0 204800 verity /dev/sda1 /dev/sda2 0 3 sha1 "\ + "9f74809a2ee7607b16fcc70d9399a4de9725a727" +]] + +A command line tool is available to compute the hash tree and return the +root hash value. + http://git.chromium.org/cgi-bin/gitweb.cgi?p=dm-verity.git;a=tree diff --git a/drivers/md/Kconfig b/drivers/md/Kconfig index 71000078351a..6ffff5a96197 100644 --- a/drivers/md/Kconfig +++ b/drivers/md/Kconfig @@ -237,6 +237,44 @@ config DM_CRYPT If unsure, say N. +config DM_BHT + tristate "Block hash tree support" + select CRYPTO + select CRYPTO_HASH + ---help--- + Include support for device-mapper devices to use a block hash + tree for managing data integrity checks in a scalable way. + + Targets that use this functionality should include it + automatically. + + If unsure, say N. + +config DM_VERITY + tristate "Verity target support" + depends on BLK_DEV_DM + select DM_BHT + select CRYPTO + select CRYPTO_HASH + # TODO(wad) it is possible that the hash context per cpu approach + # is not safe with PREEMPT. Until checked, PREEMPT is + # not allowed. + depends on !PREEMPT + ---help--- + This device-mapper target allows you to create a device that + transparently integrity checks the data on it. You'll need to + activate the digests you're going to use in the cryptoapi + configuration. + + Information on how to use dm-verity can be found on + + + + To compile this code as a module, choose M here: the module will + be called dm-verity. + + If unsure, say N. + config DM_SNAPSHOT tristate "Snapshot target" depends on BLK_DEV_DM diff --git a/drivers/md/Makefile b/drivers/md/Makefile index 046860c7a166..3e8687c39733 100644 --- a/drivers/md/Makefile +++ b/drivers/md/Makefile @@ -30,6 +30,8 @@ obj-$(CONFIG_BLK_DEV_MD) += md-mod.o obj-$(CONFIG_BLK_DEV_DM) += dm-mod.o obj-$(CONFIG_DM_BUFIO) += dm-bufio.o obj-$(CONFIG_DM_CRYPT) += dm-crypt.o +obj-$(CONFIG_DM_BHT) += dm-bht.o +obj-$(CONFIG_DM_VERITY) += dm-verity.o obj-$(CONFIG_DM_DELAY) += dm-delay.o obj-$(CONFIG_DM_FLAKEY) += dm-flakey.o obj-$(CONFIG_DM_MULTIPATH) += dm-multipath.o dm-round-robin.o diff --git a/drivers/md/dm-bht.c b/drivers/md/dm-bht.c new file mode 100644 index 000000000000..4ff79156bb7a --- /dev/null +++ b/drivers/md/dm-bht.c @@ -0,0 +1,1305 @@ + /* + * Copyright (C) 2010 The Chromium OS Authors + * + * Device-Mapper block hash tree interface. + * See Documentation/device-mapper/dm-bht.txt for details. + * + * This file is released under the GPL. + */ + +#include +#include +#include /* for fls() */ +#include +#include /* nr_cpu_ids */ +/* #define CONFIG_DM_DEBUG 1 */ +#include +#include +#include +#include +#include +#include +#include +#include +#include /* k*alloc */ +#include /* memset */ + +#define DM_MSG_PREFIX "dm bht" + +/* For sector formatting. */ +#if defined(_LP64) || defined(__LP64__) || __BITS_PER_LONG == 64 +#define __PRIS_PREFIX "z" +#else +#define __PRIS_PREFIX "ll" +#endif +#define PRIu64 __PRIS_PREFIX "u" + + +/*----------------------------------------------- + * Utilities + *-----------------------------------------------*/ + +static u8 from_hex(u8 ch) +{ + if ((ch >= '0') && (ch <= '9')) + return ch - '0'; + if ((ch >= 'a') && (ch <= 'f')) + return ch - 'a' + 10; + if ((ch >= 'A') && (ch <= 'F')) + return ch - 'A' + 10; + return -1; +} + +/** + * dm_bht_bin_to_hex - converts a binary stream to human-readable hex + * @binary: a byte array of length @binary_len + * @hex: a byte array of length @binary_len * 2 + 1 + */ +static void dm_bht_bin_to_hex(u8 *binary, u8 *hex, unsigned int binary_len) +{ + while (binary_len-- > 0) { + sprintf((char *__restrict__)hex, "%02hhx", (int)*binary); + hex += 2; + binary++; + } +} + +/** + * dm_bht_hex_to_bin - converts a hex stream to binary + * @binary: a byte array of length @binary_len + * @hex: a byte array of length @binary_len * 2 + 1 + */ +static void dm_bht_hex_to_bin(u8 *binary, const u8 *hex, + unsigned int binary_len) +{ + while (binary_len-- > 0) { + *binary = from_hex(*(hex++)); + *binary *= 16; + *binary += from_hex(*(hex++)); + binary++; + } +} + +static void dm_bht_log_mismatch(struct dm_bht *bht, u8 *given, u8 *computed) +{ + u8 given_hex[DM_BHT_MAX_DIGEST_SIZE * 2 + 1]; + u8 computed_hex[DM_BHT_MAX_DIGEST_SIZE * 2 + 1]; + dm_bht_bin_to_hex(given, given_hex, bht->digest_size); + dm_bht_bin_to_hex(computed, computed_hex, bht->digest_size); + DMERR_LIMIT("%s != %s", given_hex, computed_hex); +} + +/* Used for turning verifiers into computers */ +typedef int (*dm_bht_compare_cb)(struct dm_bht *, u8 *, u8 *); + +/** + * dm_bht_compute_and_compare: hashes a page of data; compares it to a hash + */ +static int dm_bht_compute_and_compare(struct dm_bht *bht, + struct hash_desc *hash_desc, + struct page *page, + u8 *expected_digest, + dm_bht_compare_cb compare_cb) +{ + struct scatterlist sg; + u8 digest[DM_BHT_MAX_DIGEST_SIZE]; + int result; + + sg_init_table(&sg, 1); + sg_set_page(&sg, page, PAGE_SIZE, 0); + /* Note, this is synchronous. */ + if (crypto_hash_init(hash_desc)) { + DMCRIT("failed to reinitialize crypto hash (proc:%d)", + smp_processor_id()); + return -EINVAL; + } + if (crypto_hash_digest(hash_desc, &sg, PAGE_SIZE, digest)) { + DMCRIT("crypto_hash_digest failed"); + return -EINVAL; + } + + result = compare_cb(bht, expected_digest, digest); +#ifdef CONFIG_DM_DEBUG + if (result) + dm_bht_log_mismatch(bht, expected_digest, digest); +#endif + return result; +} + +static __always_inline struct dm_bht_level *dm_bht_get_level(struct dm_bht *bht, + unsigned int depth) +{ + return &bht->levels[depth]; +} + +static __always_inline unsigned int dm_bht_get_level_shift(struct dm_bht *bht, + unsigned int depth) +{ + return (bht->depth - depth) * bht->node_count_shift; +} + +/* For the given depth, this is the entry index. At depth+1 it is the node + * index for depth. + */ +static __always_inline unsigned int dm_bht_index_at_level(struct dm_bht *bht, + unsigned int depth, + unsigned int leaf) +{ + return leaf >> dm_bht_get_level_shift(bht, depth); +} + +static __always_inline u8 *dm_bht_node(struct dm_bht *bht, + struct dm_bht_entry *entry, + unsigned int node_index) +{ + return &entry->nodes[node_index * bht->digest_size]; +} + + +/*----------------------------------------------- + * Implementation functions + *-----------------------------------------------*/ + +static int dm_bht_initialize_entries(struct dm_bht *bht); + +static int dm_bht_read_callback_stub(void *ctx, sector_t start, u8 *dst, + sector_t count, + struct dm_bht_entry *entry); +static int dm_bht_write_callback_stub(void *ctx, sector_t start, + u8 *dst, sector_t count, + struct dm_bht_entry *entry); + +/** + * dm_bht_create - prepares @bht for us + * @bht: pointer to a dm_bht_create()d bht + * @depth: tree depth without the root; including block hashes + * @block_count:the number of block hashes / tree leaves + * @alg_name: crypto hash algorithm name + * + * Returns 0 on success. + * + * Callers can offset into devices by storing the data in the io callbacks. + * TODO(wad) bust up into smaller helpers + */ +int dm_bht_create(struct dm_bht *bht, unsigned int depth, + unsigned int block_count, const char *alg_name) +{ + int status = 0; + int cpu = 0; + + /* Allocate enough crypto contexts to be able to perform verifies + * on all available CPUs. + */ + bht->hash_desc = (struct hash_desc *) + kcalloc(nr_cpu_ids, sizeof(struct hash_desc), GFP_KERNEL); + if (!bht->hash_desc) { + DMERR("failed to allocate crypto hash contexts"); + return -ENOMEM; + } + + /* Setup the hash first. Its length determines much of the bht layout */ + for (cpu = 0; cpu < nr_cpu_ids; ++cpu) { + bht->hash_desc[cpu].tfm = crypto_alloc_hash(alg_name, 0, 0); + if (IS_ERR(bht->hash_desc[cpu].tfm)) { + DMERR("failed to allocate crypto hash '%s'", alg_name); + status = -ENOMEM; + bht->hash_desc[cpu].tfm = NULL; + goto bad_hash_alg; + } + } + bht->digest_size = crypto_hash_digestsize(bht->hash_desc[0].tfm); + /* We expect to be able to pack >=2 hashes into a page */ + if (PAGE_SIZE / bht->digest_size < 2) { + DMERR("too few hashes fit in a page"); + status = -EINVAL; + goto bad_digest_len; + } + + if (bht->digest_size > DM_BHT_MAX_DIGEST_SIZE) { + DMERR("DM_BHT_MAX_DIGEST_SIZE too small for chosen digest"); + status = -EINVAL; + goto bad_digest_len; + } + + bht->root_digest = kzalloc(bht->digest_size, GFP_KERNEL); + if (!bht->root_digest) { + DMERR("failed to allocate memory for root digest"); + status = -ENOMEM; + goto bad_root_digest_alloc; + } + /* We use the same defines for root state but just: + * UNALLOCATED, REQUESTED, and VERIFIED since the workflow is + * different. + */ + atomic_set(&bht->root_state, DM_BHT_ENTRY_UNALLOCATED); + + /* Configure the tree */ + bht->block_count = block_count; + DMDEBUG("Setting block_count %u", block_count); + if (block_count == 0) { + DMERR("block_count must be non-zero"); + status = -EINVAL; + goto bad_block_count; + } + + bht->depth = depth; /* assignment below */ + DMDEBUG("Setting depth %u", depth); + if (!depth || depth > UINT_MAX / sizeof(struct dm_bht_level)) { + DMERR("bht depth is invalid: %u", depth); + status = -EINVAL; + goto bad_depth; + } + + /* Allocate levels. Each level of the tree may have an arbitrary number + * of dm_bht_entry structs. Each entry contains node_count nodes. + * Each node in the tree is a cryptographic digest of either node_count + * nodes on the subsequent level or of a specific block on disk. + */ + bht->levels = (struct dm_bht_level *) + kcalloc(depth, sizeof(struct dm_bht_level), GFP_KERNEL); + if (!bht->levels) { + DMERR("failed to allocate tree levels"); + status = -ENOMEM; + goto bad_level_alloc; + } + + /* Each dm_bht_entry->nodes is one page. The node code tracks + * how many nodes fit into one entry where a node is a single + * hash (message digest). + */ + bht->node_count_shift = fls(PAGE_SIZE / bht->digest_size) - 1; + /* Round down to the nearest power of two. This makes indexing + * into the tree much less painful. + */ + bht->node_count = 1 << bht->node_count_shift; + + /* TODO(wad) if node_count < DM_BHT_MAX_NODE_COUNT, then retry with + * node_count_shift-1. + */ + if (bht->node_count > DM_BHT_MAX_NODE_COUNT) { + DMERR("node_count maximum node bitmap size"); + status = -EINVAL; + goto bad_node_count; + } + + /* This is unlikely to happen, but with 64k pages, who knows. */ + if (bht->node_count > UINT_MAX / bht->digest_size) { + DMERR("node_count * hash_len exceeds UINT_MAX!"); + status = -EINVAL; + goto bad_node_count; + } + /* Ensure that we can safely shift by this value. */ + if (depth * bht->node_count_shift >= sizeof(unsigned int) * 8) { + DMERR("specified depth and node_count_shift is too large"); + status = -EINVAL; + goto bad_node_count; + } + + /* Setup callback stubs */ + bht->read_cb = &dm_bht_read_callback_stub; + bht->write_cb = &dm_bht_write_callback_stub; + + status = dm_bht_initialize_entries(bht); + if (status) + goto bad_entries_alloc; + + bht->verify_mode = DM_BHT_REVERIFY_LEAVES; + bht->entry_readahead = 0; + return 0; + +bad_entries_alloc: + while (bht->depth-- > 0) + kfree(bht->levels[bht->depth].entries); +bad_node_count: + kfree(bht->levels); +bad_level_alloc: +bad_block_count: +bad_depth: + kfree(bht->root_digest); +bad_root_digest_alloc: +bad_digest_len: + for (cpu = 0; cpu < nr_cpu_ids; ++cpu) + if (bht->hash_desc[cpu].tfm) + crypto_free_hash(bht->hash_desc[cpu].tfm); +bad_hash_alg: + kfree(bht->hash_desc); + return status; +} +EXPORT_SYMBOL(dm_bht_create); + +static int dm_bht_initialize_entries(struct dm_bht *bht) +{ + /* The last_index represents the index into the last + * block digest that will be stored in the tree. By walking the + * tree with that index, it is possible to compute the total number + * of entries needed at each level in the tree. + * + * Since each entry will contain up to |node_count| nodes of the tree, + * it is possible that the last index may not be at the end of a given + * entry->nodes. In that case, it is assumed the value is padded. + * + * Note, we treat both the tree root (1 hash) and the tree leaves + * independently from the bht data structures. Logically, the root is + * depth=-1 and the block layer level is depth=bht->depth + */ + unsigned int last_index = ALIGN(bht->block_count, bht->node_count) - 1; + unsigned int total_entries = 0; + struct dm_bht_level *level = NULL; + unsigned int depth; + + /* check that the largest level->count can't result in an int overflow + * on allocation or sector calculation. + */ + if (((last_index >> bht->node_count_shift) + 1) > + UINT_MAX / max((unsigned int)sizeof(struct dm_bht_entry), + (unsigned int)to_sector(PAGE_SIZE))) { + DMCRIT("required entries %u is too large", + last_index + 1); + return -EINVAL; + } + + /* Track the current sector location for each level so we don't have to + * compute it during traversals. + */ + bht->sectors = 0; + for (depth = 0; depth < bht->depth; ++depth) { + level = dm_bht_get_level(bht, depth); + level->count = dm_bht_index_at_level(bht, depth, + last_index) + 1; + DMDEBUG("depth: %u entries: %u", depth, level->count); + /* TODO(wad) consider the case where the data stored for each + * level is done with contiguous pages (instead of using + * entry->nodes) and the level just contains two bitmaps: + * (a) which pages have been loaded from disk + * (b) which specific nodes have been verified. + */ + level->entries = kcalloc(level->count, + sizeof(struct dm_bht_entry), + GFP_KERNEL); + if (!level->entries) { + DMERR("failed to allocate entries for depth %u", + bht->depth); + /* let the caller clean up the mess */ + return -ENOMEM; + } + total_entries += level->count; + level->sector = bht->sectors; + /* number of sectors per entry * entries at this level */ + bht->sectors += level->count * to_sector(PAGE_SIZE); + /* not ideal, but since unsigned overflow behavior is defined */ + if (bht->sectors < level->sector) { + DMCRIT("level sector calculation overflowed"); + return -EINVAL; + } + } + + /* Go ahead and reserve enough space for everything. We really don't + * want memory allocation failures. Once we start freeing verified + * entries, then we can reduce this reservation. + */ + bht->entry_pool = mempool_create_page_pool(total_entries, 0); + if (!bht->entry_pool) { + DMERR("failed to allocate mempool"); + return -ENOMEM; + } + + return 0; +} + +static int dm_bht_read_callback_stub(void *ctx, sector_t start, u8 *dst, + sector_t count, struct dm_bht_entry *entry) +{ + DMCRIT("dm_bht_read_callback_stub called!"); + dm_bht_read_completed(entry, -EIO); + return -EIO; +} + +static int dm_bht_write_callback_stub(void *ctx, sector_t start, + u8 *dst, sector_t count, + struct dm_bht_entry *entry) +{ + DMCRIT("dm_bht_write_callback_stub called!"); + dm_bht_write_completed(entry, -EIO); + return -EIO; +} + +/** + * dm_bht_read_completed + * @entry: pointer to the entry that's been loaded + * @status: I/O status. Non-zero is failure. + * MUST always be called after a read_cb completes. + */ +void dm_bht_read_completed(struct dm_bht_entry *entry, int status) +{ + int state; + if (status) { + /* TODO(wad) add retry support */ + DMCRIT("an I/O error occurred while reading entry"); + atomic_set(&entry->state, DM_BHT_ENTRY_ERROR_IO); + /* entry->nodes will be freed later */ + return; + } + state = atomic_cmpxchg(&entry->state, + DM_BHT_ENTRY_PENDING, + DM_BHT_ENTRY_READY); + if (state != DM_BHT_ENTRY_PENDING) { + DMCRIT("state changed on entry out from under io"); + BUG(); + } +} +EXPORT_SYMBOL(dm_bht_read_completed); + +/** + * dm_bht_write_completed + * @entry: pointer to the entry that's been loaded + * @status: I/O status. Non-zero is failure. + * Should be called after a write_cb completes. Currently only catches + * errors which more writers don't care about. + */ +void dm_bht_write_completed(struct dm_bht_entry *entry, int status) +{ + if (status) { + DMCRIT("an I/O error occurred while writing entry"); + atomic_set(&entry->state, DM_BHT_ENTRY_ERROR_IO); + /* entry->nodes will be freed later */ + return; + } +} +EXPORT_SYMBOL(dm_bht_write_completed); + + +/* dm_bht_maybe_read_entries + * Attempts to atomically acquire each entry, allocated any needed + * memory, and issues I/O callbacks to load the hashes from disk. + * Returns 0 if all entries are loaded and verified. On error, the + * return value is negative. When positive, it is the state values + * ORd. + */ +static int dm_bht_maybe_read_entries(struct dm_bht *bht, void *ctx, + unsigned int depth, unsigned int index, + unsigned int count, bool until_exist) +{ + struct dm_bht_level *level; + struct dm_bht_entry *entry, *last_entry; + sector_t current_sector; + int state = 0; + int status = 0; + struct page *node_page = NULL; + BUG_ON(depth >= bht->depth); + + level = &bht->levels[depth]; + if (count > level->count - index) { + DMERR("dm_bht_maybe_read_entries(d=%u,ei=%u,count=%u): " + "index+count exceeds available entries %u", + depth, index, count, level->count); + return -EINVAL; + } + /* XXX: hardcoding PAGE_SIZE means that a perfectly valid image + * on one system may not work on a different kernel. + * TODO(wad) abstract PAGE_SIZE with a bht->entry_size or + * at least a define and ensure bht->entry_size is + * sector aligned at least. + */ + current_sector = level->sector + to_sector(index * PAGE_SIZE); + for (entry = &level->entries[index], last_entry = entry + count; + entry < last_entry; + ++entry, current_sector += to_sector(PAGE_SIZE)) { + /* If the entry's state is UNALLOCATED, then we'll claim it + * for allocation and loading. + */ + state = atomic_cmpxchg(&entry->state, + DM_BHT_ENTRY_UNALLOCATED, + DM_BHT_ENTRY_PENDING); + DMDEBUG("dm_bht_maybe_read_entries(d=%u,ei=%u,count=%u): " + "ei=%lu, state=%d", + depth, index, count, + (unsigned long)(entry - level->entries), state); + if (state <= DM_BHT_ENTRY_ERROR) { + DMCRIT("entry %u is in an error state", index); + return state; + } + + /* Currently, the verified state is unused. */ + if (state == DM_BHT_ENTRY_VERIFIED) { + if (until_exist) + return 0; + /* Makes 0 == verified. Is that ideal? */ + continue; + } + + if (state != DM_BHT_ENTRY_UNALLOCATED) { + /* PENDING, READY, ... */ + if (until_exist) + return state; + status |= state; + continue; + } + /* Current entry is claimed for allocation and loading */ + node_page = (struct page *) mempool_alloc(bht->entry_pool, + GFP_NOIO); + if (!node_page) { + DMCRIT("failed to allocate memory for " + "entry->nodes from pool"); + return -ENOMEM; + } + /* dm-bht guarantees page-aligned memory for callbacks. */ + entry->nodes = page_address(node_page); + /* Let the caller know that not all the data is yet available */ + status |= DM_BHT_ENTRY_REQUESTED; + /* Issue the read callback */ + /* TODO(wad) error check callback here too */ + DMDEBUG("dm_bht_maybe_read_entries(d=%u,ei=%u,count=%u): " + "reading %lu", + depth, index, count, + (unsigned long)(entry - level->entries)); + bht->read_cb(ctx, /* external context */ + current_sector, /* starting sector */ + entry->nodes, /* destination */ + to_sector(PAGE_SIZE), + entry); /* io context */ + + } + /* Should only be 0 if all entries were verified and not just ready */ + return status; +} + +static int dm_bht_compare_hash(struct dm_bht *bht, u8 *known, u8 *computed) +{ + return memcmp(known, computed, bht->digest_size); +} + +static int dm_bht_update_hash(struct dm_bht *bht, u8 *known, u8 *computed) +{ +#ifdef CONFIG_DM_DEBUG + u8 hex[DM_BHT_MAX_DIGEST_SIZE * 2 + 1]; +#endif + memcpy(known, computed, bht->digest_size); +#ifdef CONFIG_DM_DEBUG + dm_bht_bin_to_hex(computed, hex, bht->digest_size); + DMDEBUG("updating with hash: %s", hex); +#endif + return 0; +} + +/* digest length and bht must be checked already */ +static int dm_bht_check_block(struct dm_bht *bht, unsigned int block_index, + u8 *digest, int *entry_state) +{ + int depth; + unsigned int index; + struct dm_bht_entry *entry; + + /* The leaves contain the block hashes */ + depth = bht->depth - 1; + + /* Index into the bottom level. Each entry in this level contains + * nodes whose hashes are the direct hashes of one block of data on + * disk. + */ + index = block_index >> bht->node_count_shift; + entry = &bht->levels[depth].entries[index]; + + *entry_state = atomic_read(&entry->state); + if (*entry_state <= DM_BHT_ENTRY_ERROR) { + DMCRIT("leaf entry for block %u is invalid", + block_index); + return *entry_state; + } + if (*entry_state <= DM_BHT_ENTRY_PENDING) { + DMERR("leaf data not yet loaded for block %u", + block_index); + return 1; + } + + /* Index into the entry data */ + index = (block_index % bht->node_count) * bht->digest_size; + if (memcmp(&entry->nodes[index], digest, bht->digest_size)) { + DMCRIT("digest mismatch for block %u", block_index); + dm_bht_log_mismatch(bht, &entry->nodes[index], digest); + return DM_BHT_ENTRY_ERROR_MISMATCH; + } + /* TODO(wad) update bht->block_bitmap here or in the caller */ + return 0; +} + +/* Walk all entries at level 0 to compute the root digest. + * 0 on success. + */ +static int dm_bht_compute_root(struct dm_bht *bht, u8 *digest) +{ + struct dm_bht_entry *entry; + unsigned int count; + struct scatterlist sg; /* feeds digest() */ + struct hash_desc *hash_desc; + hash_desc = &bht->hash_desc[smp_processor_id()]; + entry = bht->levels[0].entries; + + if (crypto_hash_init(hash_desc)) { + DMCRIT("failed to reinitialize crypto hash (proc:%d)", + smp_processor_id()); + return -EINVAL; + } + + /* Point the scatterlist to the entries, then compute the digest */ + for (count = 0; count < bht->levels[0].count; ++count, ++entry) { + if (atomic_read(&entry->state) <= DM_BHT_ENTRY_PENDING) { + DMCRIT("data not ready to compute root: %u", + count); + return 1; + } + sg_init_table(&sg, 1); + sg_set_page(&sg, virt_to_page(entry->nodes), PAGE_SIZE, 0); + if (crypto_hash_update(hash_desc, &sg, PAGE_SIZE)) { + DMCRIT("Failed to update crypto hash"); + return -EINVAL; + } + } + + if (crypto_hash_final(hash_desc, digest)) { + DMCRIT("Failed to compute final digest"); + return -EINVAL; + } + + return 0; +} + +static int dm_bht_verify_root(struct dm_bht *bht, + dm_bht_compare_cb compare_cb) +{ + int status = 0; + u8 digest[DM_BHT_MAX_DIGEST_SIZE]; + if (atomic_read(&bht->root_state) == DM_BHT_ENTRY_VERIFIED) + return 0; + status = dm_bht_compute_root(bht, digest); + if (status) { + DMCRIT("Failed to compute root digest for verification"); + return status; + } + DMDEBUG("root computed"); + status = compare_cb(bht, bht->root_digest, digest); + if (status) { + DMCRIT("invalid root digest: %d", status); + dm_bht_log_mismatch(bht, bht->root_digest, digest); + return DM_BHT_ENTRY_ERROR_MISMATCH; + } + /* Could do a cmpxchg, but this should be safe. */ + atomic_set(&bht->root_state, DM_BHT_ENTRY_VERIFIED); + return 0; +} + +/* dm_bht_verify_path + * Verifies the path to block_index from depth=0. Returns 0 on ok. + */ +static int dm_bht_verify_path(struct dm_bht *bht, unsigned int block_index, + dm_bht_compare_cb compare_cb) +{ + unsigned int depth; + struct dm_bht_level *level; + struct dm_bht_entry *entry; + struct dm_bht_entry *parent = NULL; + u8 *node; + unsigned int node_index; + unsigned int entry_index; + unsigned int parent_index = 0; /* for logging */ + struct hash_desc *hash_desc; + int state; + int ret = 0; + + hash_desc = &bht->hash_desc[smp_processor_id()]; + for (depth = 0; depth < bht->depth; ++depth) { + level = dm_bht_get_level(bht, depth); + entry_index = dm_bht_index_at_level(bht, depth, block_index); + DMDEBUG("verify_path for bi=%u on d=%d ei=%u (max=%u)", + block_index, depth, entry_index, level->count); + BUG_ON(entry_index >= level->count); + entry = &level->entries[entry_index]; + + /* Catch any existing errors */ + state = atomic_read(&entry->state); + if (state <= DM_BHT_ENTRY_ERROR) { + DMCRIT("entry(d=%u,idx=%u) is in an error state: %d", + depth, entry_index, state); + DMCRIT("verification is not possible"); + ret = 1; + break; + } else if (state <= DM_BHT_ENTRY_PENDING) { + DMERR("entry not ready for verify: d=%u,e=%u", + depth, entry_index); + ret = 1; + break; + } + + /* We go through the error-checking for depth=0 here so we + * don't have to duplicate it above. After this point, parent + * will always be set. + */ + if (!parent) { + parent_index = entry_index; + parent = entry; + continue; + } + + + /* Because we are one level down, the node_index into the + * the parent's node list modulo the number of nodes. + */ + node_index = entry_index % bht->node_count; + + /* If the nodes in entry have already been checked against the + * parent, then we're done. + */ + if (state == DM_BHT_ENTRY_VERIFIED) { + DMDEBUG("verify_path node %u is verified in parent %u", + node_index, parent_index); + /* If this entry has been checked, move along. */ + parent_index = entry_index; + parent = entry; + continue; + } + + DMDEBUG("verify_path node %u is not verified in parent %u", + node_index, parent_index); + /* We need to check that this entry matches the expected + * hash in the parent->nodes. + */ + node = dm_bht_node(bht, parent, node_index); + if (dm_bht_compute_and_compare(bht, hash_desc, + virt_to_page(entry->nodes), + node, compare_cb)) { + DMERR("failed to verify entry's hash against parent " + "(d=%u,ei=%u,p=%u,pnode=%u)", + depth, entry_index, parent_index, node_index); + ret = DM_BHT_ENTRY_ERROR_MISMATCH; + break; + } + + /* Instead of keeping a bitmap of which children have been + * checked, this data is kept in the child state. If full + * reverifies have been set, then no intermediate entry/node is + * ever marked as verified. + */ + if (bht->verify_mode != DM_BHT_FULL_REVERIFY) + atomic_cmpxchg(&entry->state, + DM_BHT_ENTRY_READY, + DM_BHT_ENTRY_VERIFIED); + parent_index = entry_index; + parent = entry; + } + return ret; +} + +/** + * dm_bht_store_block - sets a given block's hash in the tree + * @bht: pointer to a dm_bht_create()d bht + * @block_index:numeric index of the block in the tree + * @digest: array of u8s containing the digest of length @bht->digest_size + * + * Returns 0 on success, >0 when data is pending, and <0 when a IO or other + * error has occurred. + * + * If the containing entry in the tree is unallocated, it will allocate memory + * and mark the entry as ready. All other block entries will be 0s. This + * function is not safe for simultaneous use when verifying data and should not + * be used if the @bht is being accessed by any other functions in any other + * threads/processes. + * + * It is expected that virt_to_page will work on |block_data|. + */ +int dm_bht_store_block(struct dm_bht *bht, unsigned int block_index, + u8 *block_data) +{ + int depth; + unsigned int index; + unsigned int node_index; + struct dm_bht_entry *entry; + struct dm_bht_level *level; + int state; + struct hash_desc *hash_desc; + struct page *node_page = NULL; + + /* Look at the last level of nodes above the leaves (data blocks) */ + depth = bht->depth - 1; + + /* Index into the level */ + level = dm_bht_get_level(bht, depth); + index = dm_bht_index_at_level(bht, depth, block_index); + /* Grab the node index into the current entry by getting the + * index at the leaf-level. + */ + node_index = dm_bht_index_at_level(bht, depth + 1, block_index) % + bht->node_count; + entry = &level->entries[index]; + + DMDEBUG("Storing block %u in d=%d,ei=%u,ni=%u,s=%d", + block_index, depth, index, node_index, + atomic_read(&entry->state)); + + state = atomic_cmpxchg(&entry->state, + DM_BHT_ENTRY_UNALLOCATED, + DM_BHT_ENTRY_PENDING); + /* !!! Note. It is up to the users of the update interface to + * ensure the entry data is fully populated prior to use. + * The number of updated entries is NOT tracked. + */ + if (state == DM_BHT_ENTRY_UNALLOCATED) { + node_page = (struct page *) mempool_alloc(bht->entry_pool, + GFP_KERNEL); + if (!node_page) { + atomic_set(&entry->state, DM_BHT_ENTRY_ERROR); + return -ENOMEM; + } + entry->nodes = page_address(node_page); + /* TODO(wad) could expose this to the caller to that they + * can transition from unallocated to ready manually. + */ + atomic_set(&entry->state, DM_BHT_ENTRY_READY); + } else if (state <= DM_BHT_ENTRY_ERROR) { + DMCRIT("leaf entry for block %u is invalid", + block_index); + return state; + } else if (state == DM_BHT_ENTRY_PENDING) { + DMERR("leaf data is pending for block %u", block_index); + return 1; + } + + hash_desc = &bht->hash_desc[smp_processor_id()]; + dm_bht_compute_and_compare(bht, hash_desc, virt_to_page(block_data), + dm_bht_node(bht, entry, node_index), + &dm_bht_update_hash); + return 0; +} +EXPORT_SYMBOL(dm_bht_store_block); + +/** + * dm_bht_zeroread_callback - read callback which always returns 0s + * @ctx: ignored + * @start: ignored + * @data: buffer to write 0s to + * @count: number of sectors worth of data to write + * @complete_ctx: opaque context for @completed + * @completed: callback to confirm end of data read + * + * Always returns 0. + * + * Meant for use by dm_compute() callers. It allows dm_populate to + * be used to pre-fill a tree with zeroed out entry nodes. + */ +int dm_bht_zeroread_callback(void *ctx, sector_t start, u8 *dst, + sector_t count, struct dm_bht_entry *entry) +{ + memset(dst, 0, to_bytes(count)); + dm_bht_read_completed(entry, 0); + return 0; +} +EXPORT_SYMBOL(dm_bht_zeroread_callback); + +/** + * dm_bht_compute - computes and updates all non-block-level hashes in a tree + * @bht: pointer to a dm_bht_create()d bht + * @read_cb_ctx:opaque read_cb context for all I/O on this call + * + * Returns 0 on success, >0 when data is pending, and <0 when a IO or other + * error has occurred. + * + * Walks the tree and computes the hashes at each level from the + * hashes below. This can only be called once per tree creation + * since it will mark entries verified. Expects dm_bht_populate() to + * correctly populate the tree from the read_callback_stub. + * + * This function should not be used when verifying the same tree and + * should not be used with multiple simultaneous operators on @bht. + */ +int dm_bht_compute(struct dm_bht *bht, void *read_cb_ctx) +{ + unsigned int block; + int updated = 0; + /* Start at the last block and walk backwards. */ + for (block = bht->block_count - 1; block != 0; + block -= bht->node_count) { + DMDEBUG("Updating levels for block %u", block); + updated = dm_bht_populate(bht, read_cb_ctx, block); + if (updated < 0) { + DMERR("Failed to pre-zero entries"); + return updated; + } + updated = dm_bht_verify_path(bht, + block, + dm_bht_update_hash); + if (updated) { + DMERR("Failed to update levels for block %u", + block); + return updated; + } + if (block < bht->node_count) + break; + } + /* Don't forget the root digest! */ + DMDEBUG("Calling verify_root with update_hash"); + updated = dm_bht_verify_root(bht, dm_bht_update_hash); + return updated; +} +EXPORT_SYMBOL(dm_bht_compute); + +/** + * dm_bht_sync - writes the tree in memory to disk + * @bht: pointer to a dm_bht_create()d bht + * @write_ctx: callback context for writes issued + * + * Since all entry nodes are PAGE_SIZE, the data will be pre-aligned and + * padded. + */ +int dm_bht_sync(struct dm_bht *bht, void *write_cb_ctx) +{ + unsigned int depth; + int ret = 0; + int state; + sector_t sector; + struct dm_bht_level *level; + struct dm_bht_entry *entry; + struct dm_bht_entry *entry_end; + + for (depth = 0; depth < bht->depth; ++depth) { + level = dm_bht_get_level(bht, depth); + entry_end = level->entries + level->count; + sector = level->sector; + for (entry = level->entries; entry < entry_end; ++entry) { + state = atomic_read(&entry->state); + if (state <= DM_BHT_ENTRY_PENDING) { + DMERR("At depth %d, entry %lu is not ready", + depth, + (unsigned long)(entry - level->entries)); + return state; + } + ret = bht->write_cb(write_cb_ctx, + sector, + entry->nodes, + to_sector(PAGE_SIZE), + entry); + if (ret) { + DMCRIT("an error occurred writing entry %lu", + (unsigned long)(entry - level->entries)); + return ret; + } + sector += to_sector(PAGE_SIZE); + } + } + + return 0; +} +EXPORT_SYMBOL(dm_bht_sync); + +/** + * dm_bht_populate - reads entries from disk needed to verify a given block + * @bht: pointer to a dm_bht_create()d bht + * @read_cb_ctx:context used for all read_cb calls on this request + * @block_index:specific block data is expected from + * + * Callers may wish to call dm_bht_populate(0) immediately after initialization + * to start loading in all the level[0] entries. + */ +int dm_bht_populate(struct dm_bht *bht, void *read_cb_ctx, + unsigned int block_index) +{ + unsigned int depth; + struct dm_bht_level *level; + int populated = 0; /* return value */ + unsigned int entry_index = 0; + int read_status = 0; + int root_state = 0; + + if (block_index >= bht->block_count) { + DMERR("Request to populate for invalid block: %u", + block_index); + return -EIO; + } + DMDEBUG("dm_bht_populate(%u)", block_index); + + /* Load in all of level 0 if the root is unverified */ + root_state = atomic_read(&bht->root_state); + /* TODO(wad) create a separate io object for the root request which + * can continue on and be verified and stop making every request + * check. + */ + if (root_state != DM_BHT_ENTRY_VERIFIED) { + DMDEBUG("root data is not yet loaded"); + /* If positive, it means some are pending. */ + populated = dm_bht_maybe_read_entries(bht, read_cb_ctx, 0, 0, + bht->levels[0].count, + true); + if (populated < 0) { + DMCRIT("an error occurred while reading level[0]"); + /* TODO(wad) define std error codes */ + return populated; + } + } + + for (depth = 1; depth < bht->depth; ++depth) { + level = dm_bht_get_level(bht, depth); + entry_index = dm_bht_index_at_level(bht, depth, + block_index); + DMDEBUG("populate for bi=%u on d=%d ei=%u (max=%u)", + block_index, depth, entry_index, level->count); + + /* Except for the root node case, we should only ever need + * to load one entry along the path. + */ + read_status = dm_bht_maybe_read_entries(bht, read_cb_ctx, + depth, entry_index, + 1, false); + if (unlikely(read_status < 0)) { + DMCRIT("failure occurred reading entry %u depth %u", + entry_index, depth); + return read_status; + } + /* Accrue return code flags */ + populated |= read_status; + + /* Attempt to pull in up to entry_readahead extra entries on + * this I/O call iff we're doing the read right now. This + * helps optimize sequential access to the mapped drive. + */ + if (bht->entry_readahead && + (read_status & DM_BHT_ENTRY_REQUESTED)) { + unsigned int readahead_count; + entry_index++; + readahead_count = min(bht->entry_readahead, + level->count - entry_index); + /* The result is completely ignored since this call is + * critical for the current request. + */ + if (readahead_count) + dm_bht_maybe_read_entries(bht, read_cb_ctx, + depth, entry_index, + readahead_count, + true); + } + } + + /* All nodes are ready. The hash for the block_index can be verified */ + return populated; +} +EXPORT_SYMBOL(dm_bht_populate); + + +/** + * dm_bht_verify_block - checks that all nodes in the path for @block are valid + * @bht: pointer to a dm_bht_create()d bht + * @block_index:specific block data is expected from + * @digest: computed digest for the given block to be checked + * @digest_len: length of @digest + * + * Returns 0 on success, 1 on missing data, and a negative error + * code on verification failure. All supporting functions called + * should return similarly. + */ +int dm_bht_verify_block(struct dm_bht *bht, unsigned int block_index, + u8 *digest, unsigned int digest_len) +{ + int unverified = 0; + int entry_state = 0; + + /* TODO(wad) do we really need this? */ + if (digest_len != bht->digest_size) { + DMERR("invalid digest_len passed to verify_block"); + return -EINVAL; + } + + /* Make sure that the root has been verified */ + if (atomic_read(&bht->root_state) != DM_BHT_ENTRY_VERIFIED) { + unverified = dm_bht_verify_root(bht, dm_bht_compare_hash); + if (unverified) { + DMCRIT("Failed to verify root: %d", unverified); + return unverified; + } + } + + /* Now check that the digest supplied matches the leaf hash */ + unverified = dm_bht_check_block(bht, block_index, digest, &entry_state); + if (unverified) { + DMCRIT("Block check failed for %u: %d", block_index, + unverified); + return unverified; + } + + /* If the entry which contains the block hash is marked verified, then + * it means that its hash has been check with the parent. In addition, + * since that is only possible via verify_path, it transitively means + * it is verified to the root of the tree. If the depth is 1, then it + * means the entry was verified during root verification. + */ + if (entry_state == DM_BHT_ENTRY_VERIFIED || bht->depth == 1) + return unverified; + + /* Now check levels in between */ + unverified = dm_bht_verify_path(bht, + block_index, + dm_bht_compare_hash); + if (unverified) + DMERR("Failed to verify intermediary nodes for block: %u (%d)", + block_index, unverified); + return unverified; +} +EXPORT_SYMBOL(dm_bht_verify_block); + +/** + * dm_bht_destroy - cleans up all memory used by @bht + * @bht: pointer to a dm_bht_create()d bht + * + * Returns 0 on success. Does not free @bht itself. + */ +int dm_bht_destroy(struct dm_bht *bht) +{ + unsigned int depth; + int cpu = 0; + + kfree(bht->root_digest); + + depth = bht->depth; + while (depth-- != 0) { + struct dm_bht_entry *entry = bht->levels[depth].entries; + struct dm_bht_entry *entry_end = entry + + bht->levels[depth].count; + int state = 0; + for (; entry < entry_end; ++entry) { + state = atomic_read(&entry->state); + switch (state) { + /* At present, no other states free memory, + * but that will change. + */ + case DM_BHT_ENTRY_UNALLOCATED: + /* Allocated with improper state */ + BUG_ON(entry->nodes); + continue; + default: + BUG_ON(!entry->nodes); + mempool_free(entry->nodes, bht->entry_pool); + break; + } + } + kfree(bht->levels[depth].entries); + bht->levels[depth].entries = NULL; + } + mempool_destroy(bht->entry_pool); + kfree(bht->levels); + for (cpu = 0; cpu < nr_cpu_ids; ++cpu) + if (bht->hash_desc[cpu].tfm) + crypto_free_hash(bht->hash_desc[cpu].tfm); + kfree(bht->hash_desc); + return 0; +} +EXPORT_SYMBOL(dm_bht_destroy); + +/*----------------------------------------------- + * Accessors + *-----------------------------------------------*/ + +/** + * dm_bht_sectors - return the sectors required on disk + * @bht: pointer to a dm_bht_create()d bht + */ +sector_t dm_bht_sectors(const struct dm_bht *bht) +{ + return bht->sectors; +} +EXPORT_SYMBOL(dm_bht_sectors); + +/** + * dm_bht_set_read_cb - set read callback + * @bht: pointer to a dm_bht_create()d bht + * @read_cb: callback function used for all read requests by @bht + */ +void dm_bht_set_read_cb(struct dm_bht *bht, dm_bht_callback read_cb) +{ + bht->read_cb = read_cb; +} +EXPORT_SYMBOL(dm_bht_set_read_cb); + +/** + * dm_bht_set_write_cb - set write callback + * @bht: pointer to a dm_bht_create()d bht + * @write_cb: callback function used for all write requests by @bht + */ +void dm_bht_set_write_cb(struct dm_bht *bht, dm_bht_callback write_cb) +{ + bht->write_cb = write_cb; +} +EXPORT_SYMBOL(dm_bht_set_write_cb); + +/** + * dm_bht_set_verify_mode - set verify mode + * @bht: pointer to a dm_bht_create()d bht + * @verify_mode: indicate verification behavior + */ +void dm_bht_set_verify_mode(struct dm_bht *bht, int verify_mode) +{ + bht->verify_mode = verify_mode; +} +EXPORT_SYMBOL(dm_bht_set_verify_mode); + +/** + * dm_bht_set_entry_readahead - set verify mode + * @bht: pointer to a dm_bht_create()d bht + * @readahead_count: number of entries to readahead from a given level + */ +void dm_bht_set_entry_readahead(struct dm_bht *bht, + unsigned int readahead_count) +{ + bht->entry_readahead = readahead_count; +} +EXPORT_SYMBOL(dm_bht_set_entry_readahead); + +/** + * dm_bht_set_root_hexdigest - sets an unverified root digest hash from hex + * @bht: pointer to a dm_bht_create()d bht + * @hexdigest: array of u8s containing the new digest in binary + * Returns non-zero on error. hexdigest should be NUL terminated. + */ +int dm_bht_set_root_hexdigest(struct dm_bht *bht, const u8 *hexdigest) +{ + if (!bht->root_digest) { + DMCRIT("No allocation for root digest. Call dm_bht_create"); + return -1; + } + /* Make sure we have at least the bytes expected */ + if (strnlen((char *)hexdigest, bht->digest_size * 2) != + bht->digest_size * 2) { + DMERR("root digest length does not match hash algorithm"); + return -1; + } + dm_bht_hex_to_bin(bht->root_digest, hexdigest, bht->digest_size); +#ifdef CONFIG_DM_DEBUG + DMINFO("Set root digest to %s. Parsed as -> ", hexdigest); + dm_bht_log_mismatch(bht, bht->root_digest, bht->root_digest); +#endif + /* Mark the root as unallocated to ensure that it transitions to + * requested just in case the levels aren't loaded at this point. + */ + atomic_set(&bht->root_state, DM_BHT_ENTRY_UNALLOCATED); + return 0; +} +EXPORT_SYMBOL(dm_bht_set_root_hexdigest); + +/** + * dm_bht_root_hexdigest - returns root digest in hex + * @bht: pointer to a dm_bht_create()d bht + * @hexdigest: u8 array of size @available + * @available: must be bht->digest_size * 2 + 1 + */ +int dm_bht_root_hexdigest(struct dm_bht *bht, u8 *hexdigest, int available) +{ + if (available < 0 || + ((unsigned int) available) < bht->digest_size * 2 + 1) { + DMERR("hexdigest has too few bytes available"); + return -EINVAL; + } + if (!bht->root_digest) { + DMERR("no root digest exists to export"); + if (available > 0) + *hexdigest = 0; + return -1; + } + dm_bht_bin_to_hex(bht->root_digest, hexdigest, bht->digest_size); + return 0; +} +EXPORT_SYMBOL(dm_bht_root_hexdigest); + diff --git a/drivers/md/dm-verity.c b/drivers/md/dm-verity.c new file mode 100644 index 000000000000..b9797f9097fe --- /dev/null +++ b/drivers/md/dm-verity.c @@ -0,0 +1,1558 @@ +/* + * Originally based on dm-crypt.c, + * Copyright (C) 2003 Christophe Saout + * Copyright (C) 2004 Clemens Fruhwirth + * Copyright (C) 2006-2008 Red Hat, Inc. All rights reserved. + * Copyright (C) 2010 The Chromium OS Authors + * All Rights Reserved. + * + * This file is released under the GPL. + * + * Implements a verifying transparent block device. + * See Documentation/device-mapper/dm-verity.txt + */ +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include + +/* #define CONFIG_DM_DEBUG 1 */ +#define CONFIG_DM_VERITY_TRACE 1 +#include +#include + +#define DM_MSG_PREFIX "verity" +#define MESG_STR(x) x, sizeof(x) + +/* Supports up to 512-bit digests */ +#define VERITY_MAX_DIGEST_SIZE 64 + +/* TODO(wad) make both of these report the error line/file to a + * verity_bug function. + */ +#define VERITY_BUG(msg...) BUG() +#define VERITY_BUG_ON(cond, msg...) BUG_ON(cond) + +/* Helper for printing sector_t */ +#define ULL(x) ((unsigned long long)(x)) + +/* Requires the pool to have the given # of pages available. They are only + * used for padding non-block aligned requests so each request should need + * at most 2 additional pages. This means our maximum queue without suffering + * from memory contention could be 32 requests. + */ +#define MIN_POOL_PAGES 16 +/* IOS represent min of dm_verity_ios in a pool, but we also use it to + * preallocate biosets (MIN_IOS * 2): + * 1. We need to clone the entire bioset, including bio_vecs, before passing + * them to the underlying block layer since it may alter the values. + * 2. We need to pad out biosets that are not block aligned. + * 3. We need to be able to create biosets while loading in hashes. + * This will need more tweaking for specific workload expectations. + */ +#define MIN_IOS 32 +/* During io_bht_read, we will spawn _many_ bios for a single I/O early on, but + * once the tree is populated, we will only need MIN_IOS at most to be able to + * pad out the request. We will also need space for the padding biovecs which + * is at most 2, less than one page per side. + */ +#define MIN_BIOS (MIN_IOS * 2) + +/* We use a block size == page_size in sectors */ +#define VERITY_BLOCK_SIZE ((PAGE_SIZE) >> (SECTOR_SHIFT)) + +/* Support additional tracing of requests */ +#ifdef CONFIG_DM_VERITY_TRACE +#define VERITY_TRACE(param, fmt, args...) { \ + if (param) \ + DMINFO(fmt, ## args); \ +} +static int request_trace; +module_param(request_trace, bool, 0644); +MODULE_PARM_DESC(request_trace, "Enable request tracing to DMINFO"); + +static int alloc_trace; +module_param(alloc_trace, bool, 0644); +MODULE_PARM_DESC(alloc_trace, "Enable allocation tracing to DMINFO"); +#else +#define VERITY_TRACE(...) +#endif + +#define REQTRACE(fmt, args...) VERITY_TRACE(request_trace, "req: " fmt, ## args) +#define ALLOCTRACE(fmt, args...) \ + VERITY_TRACE(alloc_trace, "alloc: " fmt, ## args) + +/* MIN_BIOS * 2 is a safe upper bound. An upper bound is desirable, especially + * with larger dm-bhts because multiple requests will be issued to bootstrap + * initial dm-bht root verification. + */ +static int max_bios = MIN_BIOS * 2; +module_param(max_bios, int, 0644); +MODULE_PARM_DESC(max_bios, "Max number of allocated BIOs"); + +/* Provide a lightweight means of controlling the behavior of dm-verity + * devices when the module is configured. + */ +enum error_behavior_type { DM_VERITY_EIO = 0, DM_VERITY_REBOOT, + DM_VERITY_NOTHING }; +static int error_behavior = DM_VERITY_EIO; +module_param(error_behavior, int, 0444); +MODULE_PARM_DESC(error_behavior, "Behavior on error"); + + +/* Control whether the reads are optimized for sequential access by pre-reading + * entries in the bht. + */ +static int bht_readahead; +module_param(bht_readahead, int, 0644); +MODULE_PARM_DESC(bht_readahead, "number of entries to readahead in the bht"); + +/* Used for tracking pending bios as well as for exporting information via + * STATUSTYPE_INFO. + */ +struct verity_stats { + unsigned int pending_bio; + unsigned int io_queue; + unsigned int verify_queue; + unsigned int average_requeues; + unsigned int total_requeues; + unsigned long long total_requests; +}; + +/* Holds the context of a given verify operation. */ +struct verify_context { + struct bio *bio; /* block_size padded bio or aligned original bio */ + unsigned int offset; /* into current bio_vec */ + unsigned int idx; /* of current bio_vec */ + unsigned int needed; /* for next hash */ + sector_t block; /* current block */ +}; + +/* per-requested-bio private data */ +enum next_queue_type { DM_VERITY_NONE, DM_VERITY_IO, DM_VERITY_VERIFY }; +struct dm_verity_io { + struct dm_target *target; + struct bio *base_bio; + struct delayed_work work; + unsigned int next_queue; + + struct verify_context ctx; + int error; + atomic_t pending; + + sector_t sector; /* unaligned, converted to target sector */ + sector_t block; /* aligned block index */ + sector_t count; /* aligned count in blocks */ + + /* Tracks the number of bv_pages that were allocated + * from the local pool during request padding. + */ + unsigned int leading_pages; + unsigned int trailing_pages; +}; + +struct verity_config { + struct dm_dev *dev; + sector_t start; + + struct dm_dev *hash_dev; + sector_t hash_start; + + struct dm_bht bht; + + /* Pool required for io contexts */ + mempool_t *io_pool; + /* Pool and bios required for making sure that backing device reads are + * in PAGE_SIZE increments. + */ + mempool_t *page_pool; + struct bio_set *bs; + + /* Single threaded I/O submitter */ + struct workqueue_struct *io_queue; + /* Multithreaded verifier queue */ + struct workqueue_struct *verify_queue; + + char hash_alg[CRYPTO_MAX_ALG_NAME]; + struct hash_desc *hash; /* one per cpu */ + + struct verity_stats stats; +}; + +static struct kmem_cache *_verity_io_pool; + +static void kverityd_queue_verify(struct dm_verity_io *io); +static void kverityd_queue_io(struct dm_verity_io *io, bool delayed); +static void kverityd_io_bht_populate(struct dm_verity_io *io); +static void kverityd_io_bht_populate_end(struct bio *, int error); + + +/*----------------------------------------------- + * Statistic tracking functions + *-----------------------------------------------*/ + +void verity_stats_pending_bio_inc(struct verity_config *vc) +{ + vc->stats.pending_bio++; +} + +void verity_stats_pending_bio_dec(struct verity_config *vc) +{ + vc->stats.pending_bio--; +} + +void verity_stats_io_queue_inc(struct verity_config *vc) +{ + vc->stats.io_queue++; +} + +void verity_stats_verify_queue_inc(struct verity_config *vc) +{ + vc->stats.verify_queue++; +} + +void verity_stats_io_queue_dec(struct verity_config *vc) +{ + vc->stats.io_queue--; +} + +void verity_stats_verify_queue_dec(struct verity_config *vc) +{ + vc->stats.verify_queue--; +} + +void verity_stats_total_requeues_inc(struct verity_config *vc) +{ + if (vc->stats.total_requeues >= INT_MAX - 1) { + DMINFO("stats.total_requeues is wrapping"); + vc->stats.total_requeues = 0; + } + vc->stats.total_requeues++; +} + +void verity_stats_total_requests_inc(struct verity_config *vc) +{ + vc->stats.total_requests++; +} + +void verity_stats_average_requeues(struct verity_config *vc, int requeues) +{ + /* TODO(wad) */ +} + +/*----------------------------------------------- + * Allocation and utility functions + *-----------------------------------------------*/ + +static void verity_align_request(struct verity_config *vc, sector_t *start, + unsigned int *size); +static void kverityd_src_io_read_end(struct bio *clone, int error); + +/* Shared destructor for all internal bios */ +static void dm_verity_bio_destructor(struct bio *bio) +{ + struct dm_verity_io *io = bio->bi_private; + struct verity_config *vc = io->target->private; + verity_stats_pending_bio_dec(io->target->private); + bio_free(bio, vc->bs); +} + +/* This function may block if the number of outstanding requests is too high. */ +struct bio *verity_alloc_bioset(struct verity_config *vc, gfp_t gfp_mask, + int nr_iovecs) +{ + /* Don't flood the I/O or over allocate from the pools */ + verity_stats_pending_bio_inc(vc); + while (vc->stats.pending_bio > (unsigned int) max_bios) { + DMINFO("too many outstanding BIOs (%u). sleeping.", + vc->stats.pending_bio - 1); + /* The request may be for dev->bdev, but all additional requests + * come through the hash_dev and are the issue for clean up. + */ + msleep_interruptible(10); + } + return bio_alloc_bioset(gfp_mask, nr_iovecs, vc->bs); +} + +static struct dm_verity_io *verity_io_alloc(struct dm_target *ti, + struct bio *bio, sector_t sector) +{ + struct verity_config *vc = ti->private; + struct dm_verity_io *io; + unsigned int aligned_size = bio->bi_size; + sector_t aligned_sector = sector; + + ALLOCTRACE("dm_verity_io for sector %llu", ULL(sector)); + io = mempool_alloc(vc->io_pool, GFP_NOIO); + if (unlikely(!io)) + return NULL; + /* By default, assume io requests will require a hash */ + io->next_queue = DM_VERITY_IO; + io->target = ti; + io->base_bio = bio; + io->sector = sector; + io->error = 0; + io->leading_pages = 0; + io->trailing_pages = 0; + io->ctx.bio = NULL; + + verity_align_request(vc, &aligned_sector, &aligned_size); + /* Adjust the sector by the virtual starting sector */ + io->block = (to_bytes(aligned_sector)) >> PAGE_SHIFT; + io->count = aligned_size >> PAGE_SHIFT; + + DMDEBUG("io_alloc for %llu blocks starting at %llu", + ULL(io->count), ULL(io->block)); + + atomic_set(&io->pending, 0); + + return io; +} + +/* ctx->bio should be prepopulated */ +static int verity_verify_context_init(struct verity_config *vc, + struct verify_context *ctx) +{ + if (unlikely(ctx->bio == NULL)) { + DMCRIT("padded bio was not supplied to kverityd"); + return -EINVAL; + } + ctx->offset = 0; + ctx->needed = PAGE_SIZE; + ctx->idx = ctx->bio ? ctx->bio->bi_idx : 0; + /* The sector has already be translated and adjusted to the + * appropriate start for reading. + */ + ctx->block = to_bytes(ctx->bio->bi_sector) >> PAGE_SHIFT; + /* Setup the new hash context too */ + if (crypto_hash_init(&vc->hash[smp_processor_id()])) { + DMCRIT("Failed to initialize the crypto hash"); + return -EFAULT; + } + return 0; +} + +static void clone_init(struct dm_verity_io *io, struct bio *clone, + unsigned short vcnt, unsigned int size, sector_t start) +{ + struct verity_config *vc = io->target->private; + + clone->bi_private = io; + clone->bi_end_io = kverityd_src_io_read_end; + clone->bi_bdev = vc->dev->bdev; + clone->bi_rw = io->base_bio->bi_rw; + clone->bi_destructor = dm_verity_bio_destructor; + clone->bi_idx = 0; + clone->bi_vcnt = vcnt; + clone->bi_size = size; + clone->bi_sector = start; +} + +static void verity_align_request(struct verity_config *vc, sector_t *start, + unsigned int *size) +{ + sector_t base_sector; + VERITY_BUG_ON(!start || !size || !vc, "NULL arguments"); + + base_sector = *start; + /* Mask off to the starting sector for a block */ + *start = base_sector & (~(to_sector(PAGE_SIZE) - 1)); + /* Add any extra bytes from the lead */ + *size += to_bytes(base_sector - *start); + /* Now round up the size to the full block size */ + *size = PAGE_ALIGN(*size); +} + +/* Populates a bio_vec array starting with the pointer provided allocating + * pages from the given page pool until bytes reaches 0. + * The next position in the bio_vec array is returned on success. On + * failure, a NULL is returned. + * It is assumed that the bio_vec array is properly sized. + */ +static struct bio_vec *populate_bio_vec(struct bio_vec *bvec, + mempool_t *page_pool, + unsigned int bytes, + unsigned int *pages_added) { + gfp_t gfp_mask = GFP_NOIO | __GFP_HIGHMEM; + if (!bvec || !page_pool || !pages_added) + return NULL; + + while (bytes > 0) { + DMDEBUG("bytes == %u", bytes); + bvec->bv_offset = 0; + bvec->bv_len = min(bytes, (unsigned int)PAGE_SIZE); + ALLOCTRACE("page for bio_vec %p", bvec); + bvec->bv_page = mempool_alloc(page_pool, gfp_mask); + if (unlikely(bvec->bv_page == NULL)) { + DMERR("Out of pages in the page_pool!"); + return NULL; + } + bytes -= bvec->bv_len; + ++*pages_added; + ++bvec; + } + return bvec; +} + +static int verity_hash_bv(struct verity_config *vc, + struct verify_context *ctx) +{ + struct bio_vec *bv = bio_iovec_idx(ctx->bio, ctx->idx); + struct scatterlist sg; + unsigned int size = bv->bv_len - (bv->bv_offset + ctx->offset); + + /* Catch unexpected undersized bvs */ + if (bv->bv_len < bv->bv_offset + ctx->offset) { + ctx->offset = 0; + ctx->idx++; + return 0; + } + + /* Only update the hash with the amount that's needed for this + * block (as tracked in the ctx->sector). + */ + size = min(size, ctx->needed); + DMDEBUG("Updating hash for block %llu vector idx %u: " + "size:%u offset:%u+%u len:%u", + ULL(ctx->block), ctx->idx, size, bv->bv_offset, ctx->offset, + bv->bv_len); + + /* Updates the current digest hash context for the on going block */ + sg_init_table(&sg, 1); + sg_set_page(&sg, bv->bv_page, size, bv->bv_offset + ctx->offset); + + /* Use one hash_desc+tfm per cpu so that we can make use of all + * available cores when verifying. Only one context is handled per + * processor, however. + */ + if (crypto_hash_update(&vc->hash[smp_processor_id()], &sg, size)) { + DMCRIT("Failed to update crypto hash"); + return -EFAULT; + } + ctx->needed -= size; + ctx->offset += size; + + if (ctx->offset + bv->bv_offset >= bv->bv_len) { + ctx->offset = 0; + /* Bump the bio_segment counter */ + ctx->idx++; + } + + return 0; +} + +static void verity_free_buffer_pages(struct verity_config *vc, + struct dm_verity_io *io, struct bio *clone) +{ + unsigned int original_vecs = clone->bi_vcnt; + struct bio_vec *bv; + VERITY_BUG_ON(!vc || !io || !clone, "NULL arguments"); + + /* No work to do. */ + if (!io->leading_pages && !io->trailing_pages) + return; + + /* Determine which pages are ours with one page per vector. */ + original_vecs -= io->leading_pages + io->trailing_pages; + + /* Handle any leading pages */ + bv = bio_iovec_idx(clone, 0); + while (io->leading_pages--) { + if (unlikely(bv->bv_page == NULL)) { + VERITY_BUG("missing leading bv_page in padding"); + bv++; + continue; + } + mempool_free(bv->bv_page, vc->page_pool); + bv->bv_page = NULL; + bv++; + } + bv += original_vecs; + /* TODO(wad) This is probably off-by-one */ + while (io->trailing_pages--) { + if (unlikely(bv->bv_page == NULL)) { + VERITY_BUG("missing leading bv_page in padding"); + bv++; + continue; + } + mempool_free(bv->bv_page, vc->page_pool); + bv->bv_page = NULL; + bv++; + } +} + +static void kverityd_verify_cleanup(struct dm_verity_io *io, int error) +{ + struct verity_config *vc = io->target->private; + /* Clean up the pages used for padding, if any. */ + verity_free_buffer_pages(vc, io, io->ctx.bio); + + /* Release padded bio, if one was used. */ + if (io->ctx.bio != io->base_bio) { + bio_put(io->ctx.bio); + io->ctx.bio = NULL; + } + /* Propagate the verification error. */ + io->error = error; +} + +/* If the request is not successful, this handler takes action. + * TODO make this call a registered handler. + */ +static void verity_error(struct verity_config *vc, struct dm_verity_io *io, + int error) { + if (io) + io->error = -EIO; + + switch (error) { + case -EACCES: + DMERR_LIMIT("verification failure occurred"); + if (error_behavior == DM_VERITY_REBOOT) { +#ifdef CONFIG_KEXEC + kernel_kexec(); +#else + /* Test this versus emergency_restart(); */ + kernel_restart("dm-verity"); +#endif + } else if (error_behavior == DM_VERITY_NOTHING) { + /* IO errors have to be propagated. */ + if (error != -EIO && io) + io->error = 0; + } + return; + default: + break; + } +} + +/*----------------------------------------------------------------- + * Reverse flow of requests into the device. + * + * (Start at the bottom with verity_map and work your way upward). + *-----------------------------------------------------------------*/ + +static void verity_inc_pending(struct dm_verity_io *io); + +/* verity_dec_pending manages the lifetime of all dm_verity_io structs. + * Non-bug error handling is centralized through this interface and + * all passage from workqueue to workqueue. + */ +static void verity_dec_pending(struct dm_verity_io *io) +{ + struct verity_config *vc = io->target->private; + struct bio *base_bio = io->base_bio; + int error = io->error; + VERITY_BUG_ON(!io, "NULL argument"); + + DMDEBUG("dec pending %p: %d--", io, atomic_read(&io->pending)); + + if (!atomic_dec_and_test(&io->pending)) + goto more_work_to_do; + + if (unlikely(error)) { + /* We treat bad I/O as a compromise so that we go + * to recovery mode. VERITY_BUG will just reboot on + * e.g., OOM errors. + */ + verity_error(vc, io, error); + /* let the handler change the error. */ + error = io->error; + goto return_to_user; + } + + if (io->next_queue == DM_VERITY_IO) { + kverityd_queue_io(io, true); + DMDEBUG("io %p requeued for io"); + goto more_work_to_do; + } + + if (io->next_queue == DM_VERITY_VERIFY) { + verity_stats_io_queue_dec(vc); + verity_stats_verify_queue_inc(vc); + kverityd_queue_verify(io); + DMDEBUG("io %p enqueued for verify"); + goto more_work_to_do; + } + +return_to_user: + /* else next_queue == DM_VERITY_NONE */ + mempool_free(io, vc->io_pool); + + /* Return back to the caller */ + bio_endio(base_bio, error); + +more_work_to_do: + return; +} + +/* Walks the, potentially padded, data set and computes the hash of the + * data read from the untrusted source device. The computed hash is + * then passed to dm-bht for verification. + */ +static int verity_verify(struct verity_config *vc, + struct verify_context *ctx) +{ + u8 digest[VERITY_MAX_DIGEST_SIZE]; + int r; + unsigned int block = 0; + unsigned int digest_size = + crypto_hash_digestsize(vc->hash[smp_processor_id()].tfm); + + while (ctx->idx < ctx->bio->bi_vcnt) { + r = verity_hash_bv(vc, ctx); + if (r < 0) + goto bad_hash; + + /* Continue until all the data expected is processed */ + if (ctx->needed) + continue; + + /* Calculate the final block digest to check in the tree */ + if (crypto_hash_final(&vc->hash[smp_processor_id()], digest)) { + DMCRIT("Failed to compute final digest"); + r = -EFAULT; + goto bad_state; + } + block = (unsigned int)(ctx->block); + r = dm_bht_verify_block(&vc->bht, block, digest, digest_size); + /* dm_bht functions aren't expected to return errno friendly + * values. They are converted here for uniformity. + */ + if (r > 0) { + DMERR("Pending data for block %llu seen at verify", + ULL(ctx->block)); + r = -EBUSY; + goto bad_state; + } + if (r < 0) { + DMERR_LIMIT("Block hash does not match!"); + r = -EACCES; + goto bad_match; + } + REQTRACE("Block %llu verified", ULL(ctx->block)); + + /* Reset state for the next block */ + if (crypto_hash_init(&vc->hash[smp_processor_id()])) { + DMCRIT("Failed to initialize the crypto hash"); + r = -ENOMEM; + goto no_mem; + } + ctx->needed = PAGE_SIZE; + ctx->block++; + /* After completing a block, allow a reschedule. + * TODO(wad) determine if this is truly needed. + */ + cond_resched(); + } + + return 0; + +bad_hash: +no_mem: +bad_state: +bad_match: + return r; +} + +/* Services the verify workqueue */ +static void kverityd_verify(struct work_struct *work) +{ + struct delayed_work *dwork = container_of(work, struct delayed_work, + work); + struct dm_verity_io *io = container_of(dwork, struct dm_verity_io, + work); + struct verity_config *vc = io->target->private; + int r = 0; + + verity_inc_pending(io); + + /* Default to ctx.bio as the padded bio clone. + * The original bio is never touched until release. + */ + r = verity_verify_context_init(vc, &io->ctx); + + /* Only call verify if context initialization succeeded */ + if (!r) + r = verity_verify(vc, &io->ctx); + /* Free up the padded bio and tag with the return value */ + kverityd_verify_cleanup(io, r); + verity_stats_verify_queue_dec(vc); + + verity_dec_pending(io); +} + +/* After all I/O is completed successfully for a request, it is queued on the + * verify workqueue to ensure its integrity prior to returning it to the + * caller. There may be multiple workqueue threads - one per logical + * processor. + */ +static void kverityd_queue_verify(struct dm_verity_io *io) +{ + struct verity_config *vc = io->target->private; + /* verify work should never be requeued. */ + io->next_queue = DM_VERITY_NONE; + REQTRACE("Block %llu+ is being queued for verify (io:%p)", + ULL(io->block), io); + INIT_DELAYED_WORK(&io->work, kverityd_verify); + /* No delay needed. But if we move over jobs with pending io, then + * we could probably delay them here. + */ + queue_delayed_work(vc->verify_queue, &io->work, 0); +} + +/* Asynchronously called upon the completion of dm-bht I/O. The status + * of the operation is passed back to dm-bht and the next steps are + * decided by verity_dec_pending. + */ +static void kverityd_io_bht_populate_end(struct bio *bio, int error) +{ + struct dm_bht_entry *entry = (struct dm_bht_entry *) bio->bi_private; + struct dm_verity_io *io = (struct dm_verity_io *) entry->io_context; + + DMDEBUG("kverityd_io_bht_populate_end (io:%p, entry:%p)", io, entry); + /* Tell the tree to atomically update now that we've populated + * the given entry. + */ + dm_bht_read_completed(entry, error); + + /* Clean up for reuse when reading data to be checked */ + bio->bi_vcnt = 0; + bio->bi_io_vec->bv_offset = 0; + bio->bi_io_vec->bv_len = 0; + bio->bi_io_vec->bv_page = NULL; + /* Restore the private data to I/O so the destructor can be shared. */ + bio->bi_private = (void *) io; + bio_put(bio); + + /* We bail but assume the tree has been marked bad. */ + if (unlikely(error)) { + DMERR("Failed to read for block %llu (%u)", + ULL(io->base_bio->bi_sector), io->base_bio->bi_size); + io->error = error; + /* Pass through the error to verity_dec_pending below */ + } + /* When pending = 0, it will transition to reading real data */ + verity_dec_pending(io); +} + +/* Called by dm-bht (via dm_bht_populate), this function provides + * the message digests to dm-bht that are stored on disk. + */ +static int kverityd_bht_read_callback(void *ctx, sector_t start, u8 *dst, + sector_t count, + struct dm_bht_entry *entry) +{ + struct dm_verity_io *io = ctx; /* I/O for this batch */ + struct verity_config *vc; + struct bio *bio; + /* Explicitly catches these so we can use a custom bug route */ + VERITY_BUG_ON(!io || !dst || !io->target || !io->target->private); + VERITY_BUG_ON(!entry); + VERITY_BUG_ON(count != to_sector(PAGE_SIZE)); + + vc = io->target->private; + + /* The I/O context is nested inside the entry so that we don't need one + * io context per page read. + */ + entry->io_context = ctx; + + /* We should only get page size requests at present. */ + verity_inc_pending(io); + bio = verity_alloc_bioset(vc, GFP_NOIO, 1); + if (unlikely(!io->ctx.bio)) { + DMCRIT("Out of memory at bio_alloc_bioset"); + dm_bht_read_completed(entry, -ENOMEM); + return -ENOMEM; + } + bio->bi_private = (void *) entry; + bio->bi_idx = 0; + bio->bi_size = PAGE_SIZE; + bio->bi_sector = vc->hash_start + start; + bio->bi_bdev = vc->hash_dev->bdev; + bio->bi_end_io = kverityd_io_bht_populate_end; + /* Instead of using NOIDLE, we unplug on intervals */ + bio->bi_rw = (1 << BIO_RW_META); + /* Only need to free the bio since the page is managed by bht */ + bio->bi_destructor = dm_verity_bio_destructor; + bio->bi_vcnt = 1; + bio->bi_io_vec->bv_offset = 0; + bio->bi_io_vec->bv_len = to_bytes(count); + /* dst is guaranteed to be a page_pool allocation */ + bio->bi_io_vec->bv_page = virt_to_page(dst); + /* Track that this I/O is in use. There should be no risk of the io + * being removed prior since this is called synchronously. + */ + DMDEBUG("Submitting bht io %p (entry:%p)", io, entry); + generic_make_request(bio); + return 0; +} + +/* Performs the work of loading in any missing bht hashes. */ +static void kverityd_io_bht_populate(struct dm_verity_io *io) +{ + struct verity_config *vc = io->target->private; + struct request_queue *r_queue = bdev_get_queue(vc->hash_dev->bdev); + int populated = 0; + int io_status = 0; + sector_t count; + + verity_inc_pending(io); + /* Submits an io request for each missing block of block hashes. + * The last one to return will then enqueue this on the + * io workqueue. + */ + REQTRACE("populating %llu starting at block %llu (io:%p)", + ULL(io->count), ULL(io->block), io); + for (count = 0; count < io->count; ++count) { + unsigned int block = (unsigned int)(io->block + count); + /* Check for truncation. */ + VERITY_BUG_ON((sector_t)(block) < io->block); + /* populate for each block */ + DMDEBUG("Calling dm_bht_populate for %u (io:%p)", block, io); + populated = dm_bht_populate(&vc->bht, io, block); + if (populated < 0) { + DMCRIT("dm_bht_populate error for block %u (io:%p): %d", + block, io, populated); + /* verity_dec_pending will handle the error case. */ + io->error = -EPERM; + break; + } + /* Accrue view of all I/O state for the full request */ + io_status |= populated; + + /* If this io has outstanding requests, unplug the io queue */ + if (populated & DM_BHT_ENTRY_REQUESTED) { + blk_unplug(r_queue); + /* If the bht is reading ahead, a cond_resched may be + * needed to break contention. msleep_interruptible(1) + * is an alternative option. + */ + if (bht_readahead) + cond_resched(); + } + } + REQTRACE("Block %llu+ initiated %d requests (io: %p)", + ULL(io->block), atomic_read(&io->pending) - 1, io); + + /* TODO(wad) clean up the return values from maybe_read_entries */ + /* If we return verified explicitly, then later we could do IO_REMAP + * instead of resending to the verify queue. + */ + io->next_queue = DM_VERITY_VERIFY; + if (io_status == 0 || io_status == DM_BHT_ENTRY_READY) { + /* The whole request is ready. Make sure we can transition. */ + DMDEBUG("io ready to be bumped %p", io); + } + + if (io_status & DM_BHT_ENTRY_REQUESTED) { + /* If no data is pending another I/O request, this io + * will get bounced on the next queue when the last async call + * returns. + */ + DMDEBUG("io has outstanding requests %p"); + } + + /* Some I/O is pending outside of this request. */ + if (io_status & DM_BHT_ENTRY_PENDING) { + /* PENDING will cause a BHT requeue as delayed work */ + /* TODO(wad) add a callback to dm-bht for pending_cb. Then for + * each entry, the io could be delayed heavily until the end + * read cb requeue them. (entries could be added to the + * stored I/O context but races could be a challenge. + */ + DMDEBUG("io is pending %p"); + io->next_queue = DM_VERITY_IO; + } + + /* If I/O requests were issues on behalf of populate, then the last + * request will result in a requeue. If all data was pending from + * other requests, this will be requeued now. + */ + verity_dec_pending(io); +} + +/* Asynchronously called upon the completion of I/O issued + * from kverityd_src_io_read. verity_dec_pending() acts as + * the scheduler/flow manager. + */ +static void kverityd_src_io_read_end(struct bio *clone, int error) +{ + struct dm_verity_io *io = clone->bi_private; + struct verity_config *vc = io->target->private; + /* struct verity_config *vc = io->target->private; */ + + DMDEBUG("Padded I/O completed"); + if (unlikely(!bio_flagged(clone, BIO_UPTODATE) && !error)) + error = -EIO; + + if (unlikely(error)) { + DMERR("Error occurred: %d (%llu, %u)", + error, ULL(clone->bi_sector), clone->bi_size); + io->error = error; + /* verity_dec_pending will pick up the error, but it won't + * free the leading/trailing pages for us. + */ + verity_free_buffer_pages(vc, io, io->ctx.bio); + /* drop the padded bio since we'll never use it now. */ + bio_put(io->ctx.bio); + } + + /* Release the clone which just avoids the block layer from + * leaving offsets, etc in unexpected states. + */ + bio_put(clone); + + verity_dec_pending(io); + DMDEBUG("all data has been loaded from the data device"); +} + +/* If not yet underway, an I/O request will be issued to the vc->dev + * device for the data needed. It is padded to the minimum block + * size, aligned to that size, and cloned to avoid unexpected changes + * to the original bio struct. + */ +static void kverityd_src_io_read(struct dm_verity_io *io) +{ + struct verity_config *vc = io->target->private; + struct bio *base_bio = io->base_bio; + sector_t bio_start = vc->start + io->sector; + unsigned int leading_bytes = 0; + unsigned int trailing_bytes = 0; + unsigned int bio_size = base_bio->bi_size; + unsigned int vecs_needed = bio_segments(base_bio); + struct bio_vec *cur_bvec = NULL; + struct bio *clone = NULL; + + VERITY_BUG_ON(!io); + + /* If bio is non-NULL, then the data is already read. Could also check + * BIO_UPTODATE, but it doesn't seem needed. + */ + if (io->ctx.bio) { + DMDEBUG("io_read called with existing bio. bailing: %p", io); + return; + } + DMDEBUG("kverity_io_read started"); + + verity_inc_pending(io); + /* We duplicate the bio here for two reasons: + * 1. The block layer may modify the bvec array + * 2. We may need to pad to BLOCK_SIZE + * First, we have to determine if we need more segments than are + * currently in use. + */ + verity_align_request(vc, &bio_start, &bio_size); + /* Number of bytes alignment added to the start */ + leading_bytes = to_bytes((vc->start + io->sector) - bio_start); + /* ... to the end of the original bio. */ + trailing_bytes = (bio_size - leading_bytes) - base_bio->bi_size; + + /* Additions are page aligned so page sized vectors can be padded in */ + vecs_needed += PAGE_ALIGN(leading_bytes) >> PAGE_SHIFT; + vecs_needed += PAGE_ALIGN(trailing_bytes) >> PAGE_SHIFT; + + DMDEBUG("allocating bioset for padding (%u)", vecs_needed); + /* Allocate the bioset for the duplicate plus padding */ + if (vecs_needed == bio_segments(base_bio)) { + /* No padding is needed so we can just verify using the + * original bio. + */ + DMDEBUG("No padding needed!"); + io->ctx.bio = base_bio; + } else { + ALLOCTRACE("bioset for io %p, sector %llu", + io, ULL(bio_start)); + io->ctx.bio = verity_alloc_bioset(vc, GFP_NOIO, vecs_needed); + if (unlikely(io->ctx.bio == NULL)) { + DMCRIT("Failed to allocate padded bioset"); + io->error = -ENOMEM; + verity_dec_pending(io); + return; + } + + DMDEBUG("clone init"); + clone_init(io, io->ctx.bio, vecs_needed, bio_size, bio_start); + /* Now we need to copy over the iovecs, but we need to make + * sure to offset if we realigned the request. + */ + cur_bvec = io->ctx.bio->bi_io_vec; + + DMDEBUG("Populating padded bioset (%u %u)", + leading_bytes, trailing_bytes); + DMDEBUG("leading_bytes == %u", leading_bytes); + cur_bvec = populate_bio_vec(cur_bvec, vc->page_pool, + leading_bytes, &io->leading_pages); + if (unlikely(cur_bvec == NULL)) { + io->error = -ENOMEM; + verity_free_buffer_pages(vc, io, io->ctx.bio); + bio_put(io->ctx.bio); + verity_dec_pending(io); + return; + } + /* Now we should be aligned to copy the bio_vecs in place */ + DMDEBUG("copying original bvecs"); + memcpy(cur_bvec, bio_iovec(base_bio), + sizeof(struct bio_vec) * bio_segments(base_bio)); + cur_bvec += bio_segments(base_bio); + /* Handle trailing vecs */ + DMDEBUG("trailing_bytes == %u", trailing_bytes); + cur_bvec = populate_bio_vec(cur_bvec, vc->page_pool, + trailing_bytes, + &io->trailing_pages); + if (unlikely(cur_bvec == NULL)) { + io->error = -ENOMEM; + verity_free_buffer_pages(vc, io, io->ctx.bio); + bio_put(io->ctx.bio); + verity_dec_pending(io); + return; + } + } + + /* Now clone the padded request */ + DMDEBUG("Creating clone of the padded request"); + ALLOCTRACE("clone for io %p, sector %llu", + io, ULL(bio_start)); + clone = verity_alloc_bioset(vc, GFP_NOIO, bio_segments(io->ctx.bio)); + if (unlikely(!clone)) { + io->error = -ENOMEM; + /* Clean up */ + verity_free_buffer_pages(vc, io, io->ctx.bio); + if (io->ctx.bio != base_bio) + bio_put(io->ctx.bio); + verity_dec_pending(io); + return; + } + + clone_init(io, clone, bio_segments(io->ctx.bio), io->ctx.bio->bi_size, + bio_start); + DMDEBUG("Populating clone of the padded request"); + memcpy(clone->bi_io_vec, bio_iovec(io->ctx.bio), + sizeof(struct bio_vec) * clone->bi_vcnt); + + /* Submit to the block device */ + DMDEBUG("Submitting padded bio (%u became %u)", + bio_sectors(base_bio), bio_sectors(clone)); + /* XXX: check queue_max_hw_sectors(bdev_get_queue(clone->bi_bdev)); */ + generic_make_request(clone); +} + +/* kverityd_io services the I/O workqueue. For each pass through + * the I/O workqueue, a call to populate both the origin drive + * data and the hash tree data is made. + */ +static void kverityd_io(struct work_struct *work) +{ + struct delayed_work *dwork = container_of(work, struct delayed_work, + work); + struct dm_verity_io *io = container_of(dwork, struct dm_verity_io, + work); + VERITY_BUG_ON(!io->base_bio); + + if (bio_data_dir(io->base_bio) == WRITE) { + /* TODO(wad) do something smarter when writes are seen */ + printk(KERN_WARNING + "Unexpected write bio received in kverityd_io"); + io->error = -EIO; + return; + } + + /* Issue requests asynchronously. */ + verity_inc_pending(io); + kverityd_src_io_read(io); /* never updates next_queue */ + kverityd_io_bht_populate(io); /* manage next_queue eligibility */ + verity_dec_pending(io); +} + +/* All incoming requests are queued on the I/O workqueue at least once to + * acquire both the data from the real device (vc->dev) and any data from the + * hash tree device (vc->hash_dev) needed to verify the integrity of the data. + * There may be multiple I/O workqueues - one per logical processor. + */ +static void kverityd_queue_io(struct dm_verity_io *io, bool delayed) +{ + struct verity_config *vc = io->target->private; + unsigned long delay = 0; /* jiffies */ + /* Send all requests through one call to dm_bht_populate on the + * queue to ensure that all blocks are accounted for before + * proceeding on to verification. + */ + INIT_DELAYED_WORK(&io->work, kverityd_io); + /* If this context is dependent on work from another context, we just + * requeue with a delay. Later we could bounce this work to the verify + * queue and have it wait there. TODO(wad) + */ + if (delayed) { + delay = HZ / 10; + REQTRACE("block %llu+ is being delayed %lu jiffies (io:%p)", + ULL(io->block), delay, io); + } + queue_delayed_work(vc->io_queue, &io->work, delay); +} + +/* Paired with verity_dec_pending, the pending value in the io dictate the + * lifetime of a request and when it is ready to be processed on the + * workqueues. + */ +static void verity_inc_pending(struct dm_verity_io *io) +{ + atomic_inc(&io->pending); +} + +/* Block-level requests start here. */ +static int verity_map(struct dm_target *ti, struct bio *bio, + union map_info *map_context) { + struct dm_verity_io *io; + struct verity_config *vc; + struct request_queue *r_queue; + + if (unlikely(!ti)) { + DMERR("dm_target was NULL"); + return -EIO; + } + + vc = ti->private; + r_queue = bdev_get_queue(vc->dev->bdev); + + /* Barriers are passed through. Since the device is read-only, + * barrier use seems unlikely but being data-free shouldn't be blocked + * here. + */ + if (unlikely(bio_empty_barrier(bio))) { + bio->bi_bdev = vc->dev->bdev; + return DM_MAPIO_REMAPPED; + } + + /* Trace incoming bios */ + REQTRACE("Got a %s for %llu, %u bytes)", + (bio_rw(bio) == WRITE ? "WRITE" : + (bio_rw(bio) == READ ? "READ" : "READA")), + ULL(bio->bi_sector), bio->bi_size); + + verity_stats_total_requests_inc(vc); + + if (bio_data_dir(bio) == WRITE) { + /* If we silently drop writes, then the VFS layer will cache + * the write and persist it in memory. While it doesn't change + * the underlying storage, it still may be contrary to the + * behavior expected by a verified, read-only device. + */ + DMWARN_LIMIT("write request received. rejecting with -EIO."); + verity_error(vc, NULL, -EIO); + /* bio_endio(bio, -EIO); */ + return -EIO; + } else { + /* Queue up the request to be verified */ + io = verity_io_alloc(ti, bio, bio->bi_sector - ti->begin); + if (!io) { + DMERR_LIMIT("Failed to allocate and init IO data"); + return DM_MAPIO_REQUEUE; + } + verity_stats_io_queue_inc(vc); + kverityd_queue_io(io, false); + } + + return DM_MAPIO_SUBMITTED; +} + +/* + * Non-block interfaces and device-mapper specific code + */ + +/* + * Construct an verified mapping: + * + * + * + * E.g., + * /dev/sda2 /dev/sda3 0 2 sha256 + * f08aa4a3695290c569eb1b0ac032ae1040150afb527abbeb0a3da33d82fb2c6e + * + * TODO(wad): + * - Add stats: num_requeues, num_ios, etc with proc ibnterface + * - Boot time addition + * - Track block verification to free block_hashes if memory use is a concern + * Testing needed: + * - Regular slub_debug tracing (on checkins) + * - Improper block hash padding + * - Improper bundle padding + * - Improper hash layout + * - Missing padding at end of device + * - Improperly sized underlying devices + * - Out of memory conditions (make sure this isn't too flaky under high load!) + * - Incorrect superhash + * - Incorrect block hashes + * - Incorrect bundle hashes + * - Boot-up read speed; sustained read speeds + */ +static int verity_ctr(struct dm_target *ti, unsigned int argc, char **argv) +{ + struct verity_config *vc; + int cpu = 0; + int ret = 0; + int depth; + unsigned long long tmpull = 0; + sector_t blocks; + + if (argc != 6) { + ti->error = "Not enough arguments supplied"; + return -EINVAL; + } + + /* The device mapper device should be setup read-only */ + if ((dm_table_get_mode(ti->table) & ~FMODE_READ) != 0) { + ti->error = "Must be created readonly."; + return -EINVAL; + } + + ALLOCTRACE("verity_config"); + vc = kzalloc(sizeof(*vc), GFP_KERNEL); + if (!vc) { + /* TODO(wad) if this is called from the setup helper, then we + * catch these errors and do a CrOS specific thing. if not, we + * need to have this call the error handler. + */ + return -EINVAL; + } + + + /* arg3: blocks in a bundle */ + if (sscanf(argv[3], "%u", &depth) != 1 || + depth <= 0) { + ti->error = + "Zero or negative depth supplied"; + goto bad_depth; + } + /* dm-bht block size is HARD CODED to PAGE_SIZE right now. */ + /* Calculate the blocks from the given device size */ + blocks = to_bytes(ti->len) >> PAGE_SHIFT; + if (dm_bht_create(&vc->bht, (unsigned int)depth, blocks, argv[4])) { + DMERR("failed to create required bht"); + goto bad_bht; + } + if (dm_bht_set_root_hexdigest(&vc->bht, argv[5])) { + DMERR("root hexdigest error"); + goto bad_root_hexdigest; + } + dm_bht_set_read_cb(&vc->bht, kverityd_bht_read_callback); + dm_bht_set_entry_readahead(&vc->bht, bht_readahead); + + /* arg0: device to verify */ + vc->start = 0; /* TODO: should this support a starting offset? */ + /* We only ever grab the device in read-only mode. */ + ret = dm_get_device(ti, argv[0], vc->start, ti->len, + dm_table_get_mode(ti->table), &vc->dev); + if (ret) { + DMERR("Failed to acquire device '%s': %d", argv[0], ret); + ti->error = "Device lookup failed"; + goto bad_verity_dev; + } + + if (PAGE_ALIGN(vc->start) != vc->start || + PAGE_ALIGN(to_bytes(ti->len)) != to_bytes(ti->len)) { + ti->error = "Device must be PAGE_SIZE divisble/aligned"; + goto bad_hash_start; + } + + /* arg2: offset to hash on the hash device */ + if (sscanf(argv[2], "%llu", &tmpull) != 1) { + ti->error = + "Invalid hash_start supplied"; + goto bad_hash_start; + } + vc->hash_start = (sector_t)(tmpull); + + /* arg1: device with hashes. + * Note, arg1 == arg0 is okay as long as the size of + * ti->len passed to device mapper does not include + * the hashes. + */ + if (dm_get_device(ti, argv[1], vc->hash_start, dm_bht_sectors(&vc->bht), + dm_table_get_mode(ti->table), &vc->hash_dev)) { + ti->error = "Hash device lookup failed"; + goto bad_hash_dev; + } + + /* We leave the validity on the hash device open until the + * next arg. Then we go ahead and try to read in all the bundle + * hashes which live after the block hashes. If it fails, then + * the hash offset was wrong. + */ + + + /* arg4: cryptographic digest algorithm */ + if (snprintf(vc->hash_alg, CRYPTO_MAX_ALG_NAME, "%s", argv[4]) >= + CRYPTO_MAX_ALG_NAME) { + ti->error = "Hash algorithm name is too long"; + goto bad_hash; + } + /* Allocate enough crypto contexts to be able to perform verifies + * on all available CPUs. + */ + ALLOCTRACE("hash_desc array"); + vc->hash = (struct hash_desc *) + kcalloc(nr_cpu_ids, sizeof(struct hash_desc), GFP_KERNEL); + if (!vc->hash) { + DMERR("failed to allocate crypto hash contexts"); + return -ENOMEM; + } + + /* Setup the hash first. Its length determines much of the bht layout */ + for (cpu = 0; cpu < nr_cpu_ids; ++cpu) { + ALLOCTRACE("hash_tfm (per-cpu)"); + vc->hash[cpu].tfm = crypto_alloc_hash(vc->hash_alg, 0, 0); + if (IS_ERR(vc->hash[cpu].tfm)) { + DMERR("failed to allocate crypto hash '%s'", + vc->hash_alg); + vc->hash[cpu].tfm = NULL; + goto bad_hash_alg; + } + } + + /* TODO: Maybe issues a request on the io queue for block 0? */ + + /* Argument processing is done, setup operational data */ + /* Pool for dm_verity_io objects */ + ALLOCTRACE("slab pool for io objects"); + vc->io_pool = mempool_create_slab_pool(MIN_IOS, _verity_io_pool); + if (!vc->io_pool) { + ti->error = "Cannot allocate verity io mempool"; + goto bad_slab_pool; + } + + /* Used to allocate pages for padding requests to PAGE_SIZE */ + ALLOCTRACE("page pool for request padding"); + vc->page_pool = mempool_create_page_pool(MIN_POOL_PAGES, 0); + if (!vc->page_pool) { + ti->error = "Cannot allocate page mempool"; + goto bad_page_pool; + } + + /* Allocate the bioset used for request padding */ + /* TODO(wad) allocate a separate bioset for the first verify maybe */ + ALLOCTRACE("bioset for I/O reqs"); + vc->bs = bioset_create(MIN_BIOS, 0); + if (!vc->bs) { + ti->error = "Cannot allocate verity bioset"; + goto bad_bs; + } + + /* Only one thread for the workqueue to keep the memory allocation + * sane. Requests will be submitted asynchronously. blk_unplug() will + * be called at the end of each dm_populate call so that the async + * requests are batched per workqueue job. + */ + vc->io_queue = create_workqueue("kverityd_io"); + if (!vc->io_queue) { + ti->error = "Couldn't create kverityd io queue"; + goto bad_io_queue; + } + + vc->verify_queue = create_workqueue("kverityd"); + if (!vc->verify_queue) { + ti->error = "Couldn't create kverityd queue"; + goto bad_verify_queue; + } + + ti->num_flush_requests = 1; + ti->private = vc; + + /* TODO(wad) add device and hash device names */ + { + char hashdev[BDEVNAME_SIZE], vdev[BDEVNAME_SIZE]; + bdevname(vc->hash_dev->bdev, hashdev); + bdevname(vc->dev->bdev, vdev); + DMINFO("dev:%s hash:%s [sectors:%llu blocks:%llu]", vdev, + hashdev, ULL(dm_bht_sectors(&vc->bht)), ULL(blocks)); + } + return 0; + +bad_verify_queue: + destroy_workqueue(vc->io_queue); +bad_io_queue: + bioset_free(vc->bs); +bad_bs: + mempool_destroy(vc->page_pool); +bad_page_pool: + mempool_destroy(vc->io_pool); +bad_slab_pool: + for (cpu = 0; cpu < nr_cpu_ids; ++cpu) + if (vc->hash[cpu].tfm) + crypto_free_hash(vc->hash[cpu].tfm); +bad_hash_alg: +bad_hash: + kfree(vc->hash); + dm_put_device(ti, vc->hash_dev); +bad_hash_dev: +bad_hash_start: + dm_put_device(ti, vc->dev); +bad_depth: +bad_bht: +bad_root_hexdigest: +bad_verity_dev: + kfree(vc); /* hash is not secret so no need to zero */ + return -EINVAL; +} + +static void verity_dtr(struct dm_target *ti) +{ + struct verity_config *vc = (struct verity_config *) ti->private; + int cpu; + + DMDEBUG("Destroying io_queue"); + destroy_workqueue(vc->io_queue); + DMDEBUG("Destroying verify_queue"); + destroy_workqueue(vc->verify_queue); + + DMDEBUG("Destroying bs"); + bioset_free(vc->bs); + DMDEBUG("Destroying page_pool"); + mempool_destroy(vc->page_pool); + DMDEBUG("Destroying io_pool"); + mempool_destroy(vc->io_pool); + DMDEBUG("Destroying crypto hash"); + for (cpu = 0; cpu < nr_cpu_ids; ++cpu) + if (vc->hash[cpu].tfm) + crypto_free_hash(vc->hash[cpu].tfm); + kfree(vc->hash); + + DMDEBUG("Destroying block hash tree"); + dm_bht_destroy(&vc->bht); + + DMDEBUG("Putting hash_dev"); + dm_put_device(ti, vc->hash_dev); + + DMDEBUG("Putting dev"); + dm_put_device(ti, vc->dev); + DMDEBUG("Destroying config"); + kfree(vc); +} + +static int verity_status(struct dm_target *ti, status_type_t type, + char *result, unsigned int maxlen) { + struct verity_config *vc = (struct verity_config *) ti->private; + unsigned int sz = 0; + char hashdev[BDEVNAME_SIZE], vdev[BDEVNAME_SIZE]; + u8 hexdigest[VERITY_MAX_DIGEST_SIZE * 2 + 1] = { 0 }; + + dm_bht_root_hexdigest(&vc->bht, hexdigest, sizeof(hexdigest)); + + switch (type) { + case STATUSTYPE_INFO: + DMEMIT("%u %u %u %u %llu", + vc->stats.io_queue, + vc->stats.verify_queue, + vc->stats.average_requeues, + vc->stats.total_requeues, + vc->stats.total_requests); + break; + + case STATUSTYPE_TABLE: + bdevname(vc->hash_dev->bdev, hashdev); + bdevname(vc->dev->bdev, vdev); + DMEMIT("/dev/%s /dev/%s %llu %u %s %s", + vdev, + hashdev, + ULL(vc->hash_start), + vc->bht.depth, + vc->hash_alg, + hexdigest); + break; + } + return 0; +} + +static int verity_merge(struct dm_target *ti, struct bvec_merge_data *bvm, + struct bio_vec *biovec, int max_size) +{ + struct verity_config *vc = ti->private; + struct request_queue *q = bdev_get_queue(vc->dev->bdev); + + if (!q->merge_bvec_fn) + return max_size; + + bvm->bi_bdev = vc->dev->bdev; + bvm->bi_sector = vc->start + bvm->bi_sector - ti->begin; + + /* Optionally, this could just return 0 to stick to single pages. */ + return min(max_size, q->merge_bvec_fn(q, bvm, biovec)); +} + +static int verity_iterate_devices(struct dm_target *ti, + iterate_devices_callout_fn fn, void *data) +{ + struct verity_config *vc = ti->private; + + return fn(ti, vc->dev, vc->start, ti->len, data); +} + +static void verity_io_hints(struct dm_target *ti, + struct queue_limits *limits) +{ + /* + * Set this to the vc->dev value: + blk_limits_io_min(limits, chunk_size); */ + blk_limits_io_opt(limits, PAGE_SIZE); +} + +static struct target_type verity_target = { + .name = "verity", + .version = {0, 1, 0}, + .module = THIS_MODULE, + .ctr = verity_ctr, + .dtr = verity_dtr, + .map = verity_map, + .merge = verity_merge, + .status = verity_status, + .iterate_devices = verity_iterate_devices, + .io_hints = verity_io_hints, +}; + +static int __init dm_verity_init(void) +{ + int r; + + _verity_io_pool = KMEM_CACHE(dm_verity_io, 0); + if (!_verity_io_pool) + return -ENOMEM; + + r = dm_register_target(&verity_target); + if (r < 0) { + DMERR("register failed %d", r); + kmem_cache_destroy(_verity_io_pool); + /* TODO(wad): add optional recovery bail here. */ + } else { + DMINFO("dm-verity registered"); + /* TODO(wad): Add root setup to initcalls workqueue here */ + } + return r; +} + +static void __exit dm_verity_exit(void) +{ + dm_unregister_target(&verity_target); + kmem_cache_destroy(_verity_io_pool); +} + +module_init(dm_verity_init); +module_exit(dm_verity_exit); + +MODULE_AUTHOR("The Chromium OS Authors "); +MODULE_DESCRIPTION(DM_NAME " target for transparent disk integrity checking"); +MODULE_LICENSE("GPL"); diff --git a/include/linux/dm-bht.h b/include/linux/dm-bht.h new file mode 100644 index 000000000000..8d6097ef77c6 --- /dev/null +++ b/include/linux/dm-bht.h @@ -0,0 +1,153 @@ +/* + * Copyright (C) 2010 The Chromium OS Authors + * + * Device-Mapper block hash tree interface. + * See Documentation/device-mapper/dm-bht.txt for details. + * + * This file is released under the GPLv2. + */ +#ifndef __LINUX_DM_BHT_H +#define __LINUX_DM_BHT_H + +#include +#include +#include +#include + +/* To avoid allocating memory for digest tests, we just setup a + * max to use for now. + */ +#define DM_BHT_MAX_DIGEST_SIZE 128 /* 1k hashes are unlikely for now */ +#define DM_BHT_MAX_NODE_COUNT 256 + +/* UNALLOCATED, PENDING, READY, and VERIFIED are valid states. All other + * values are entry-related return codes. + */ +#define DM_BHT_ENTRY_VERIFIED 8 /* 'nodes' has been checked against parent */ +#define DM_BHT_ENTRY_READY 4 /* 'nodes' is loaded and available */ +#define DM_BHT_ENTRY_PENDING 2 /* 'nodes' is being loaded */ +#define DM_BHT_ENTRY_REQUESTED 1 /* non-state response indicating entry is + * pending because of the current call + */ +#define DM_BHT_ENTRY_UNALLOCATED 0 /* untouched */ +#define DM_BHT_ENTRY_ERROR -1 /* entry is unsuitable for use */ +#define DM_BHT_ENTRY_ERROR_IO -2 /* I/O error on load */ + +/* Additional possible return codes */ +#define DM_BHT_ENTRY_ERROR_MISMATCH -3 /* Digest mismatch */ + +/* dm_bht_entry + * Contains dm_bht->node_count tree nodes at a given tree depth. + * state is used to transactionally assure that data is paged in + * from disk. Unless dm_bht kept running crypto contexts for each + * level, we need to load in the data for on-demand verification. + */ +struct dm_bht_entry { + atomic_t state; /* see defines */ + /* Keeping an extra pointer per entry wastes up to ~33k of + * memory if a 1m blocks are used (or 66 on 64-bit arch) + */ + void *io_context; /* Reserve a pointer for use during io */ + /* data should only be non-NULL if fully populated. */ + u8 *nodes; /* The hash data used to verify the children. + * Guaranteed to be page-aligned. + */ +}; + +/* dm_bht_level + * Contains an array of entries which represent a page of hashes where + * each hash is a node in the tree at the given tree depth/level. + */ +struct dm_bht_level { + struct dm_bht_entry *entries; /* array of entries of tree nodes */ + unsigned int count; /* number of entries at this level */ + sector_t sector; /* starting sector for this level */ +}; + +/* opaque context, start, databuf, sector_count */ +typedef int(*dm_bht_callback)(void *, /* external context */ + sector_t, /* start sector */ + u8 *, /* destination page */ + sector_t, /* num sectors */ + struct dm_bht_entry *); +/* dm_bht - Device mapper block hash tree + * dm_bht provides a fixed interface for comparing data blocks + * against a cryptographic hashes stored in a hash tree. It + * optimizes the tree structure for storage on disk. + * + * The tree is built from the bottom up. A collection of data, + * external to the tree, is hashed and these hashes are stored + * as the blocks in the tree. For some number of these hashes, + * a parent node is created by hashing them. These steps are + * repeated. + * + * TODO(wad): All hash storage memory is pre-allocated and freed once an + * entire branch has been verified. + */ +enum verify_mode { DM_BHT_REVERIFY_LEAVES = 0, DM_BHT_FULL_REVERIFY }; +struct dm_bht { + /* Configured values */ + /* ENFORCE: depth must be >= 2. */ + unsigned int depth; /* Depth of the tree including the root */ + unsigned int block_count; /* Number of blocks hashed */ + char hash_alg[CRYPTO_MAX_ALG_NAME]; + int verify_mode; /* different verification modes */ + unsigned int entry_readahead; /* number of entries to attempt to + * pre-read in a level. + */ + + /* Computed values */ + unsigned int node_count; /* Data size (in hashes) for each entry */ + unsigned int node_count_shift; /* first bit set - 1 */ + /* There is one per CPU so that verified can be simultaneous. */ + struct hash_desc *hash_desc; /* Container for the hash alg */ + unsigned int digest_size; + sector_t sectors; /* Number of disk sectors used */ + + /* bool verified; Full tree is verified */ + u8 *root_digest; /* hash_alg(levels[0].entries[*].nodes) */ + atomic_t root_state; /* Uses UNALLOCATED, REQUESTED, and VERIFIED */ + struct dm_bht_level *levels; /* in reverse order */ + mempool_t *entry_pool; + /* Callbacks for reading and/or writing to the hash device */ + dm_bht_callback read_cb; + dm_bht_callback write_cb; +}; + +/* Constructor for struct dm_bht instances. */ +int dm_bht_create(struct dm_bht *bht, + unsigned int depth, + unsigned int block_count, + const char *alg_name); +/* Destructor for struct dm_bht instances. Does not free @bht */ +int dm_bht_destroy(struct dm_bht *bht); + +/* Basic accessors for struct dm_bht */ +sector_t dm_bht_sectors(const struct dm_bht *bht); +void dm_bht_set_entry_readahead(struct dm_bht *bht, + unsigned int readahead_count); +void dm_bht_set_read_cb(struct dm_bht *bht, dm_bht_callback read_cb); +void dm_bht_set_write_cb(struct dm_bht *bht, dm_bht_callback write_cb); +void dm_bht_set_verify_mode(struct dm_bht *bht, int verify_mode); +int dm_bht_set_root_hexdigest(struct dm_bht *bht, const u8 *hexdigest); +int dm_bht_root_hexdigest(struct dm_bht *bht, u8 *hexdigest, int available); + +/* Functions for loading in data from disk for verification */ +int dm_bht_populate(struct dm_bht *bht, void *read_cb_ctx, + unsigned int block_index); +int dm_bht_verify_block(struct dm_bht *bht, unsigned int block_index, + u8 *digest, unsigned int digest_len); + +/* Functions for creating struct dm_bhts on disk. A newly created dm_bht + * should not be directly used for verification. (It should be repopulated.) + * In addition, these functions aren't meant to be called in parallel. + */ +int dm_bht_compute(struct dm_bht *bht, void *read_cb_ctx); +int dm_bht_sync(struct dm_bht *bht, void *write_cb_ctx); +int dm_bht_store_block(struct dm_bht *bht, unsigned int block_index, + u8 *block_data); +int dm_bht_zeroread_callback(void *ctx, sector_t start, u8 *dst, sector_t count, + struct dm_bht_entry *entry); +void dm_bht_read_completed(struct dm_bht_entry *entry, int status); +void dm_bht_write_completed(struct dm_bht_entry *entry, int status); +#endif /* __LINUX_DM_BHT_H */ -- 2.20.1