Commit Graph

118 Commits

Author SHA1 Message Date
Ben Johnson
c53a09c124 Fix read replication stream restart position 2022-06-15 14:35:17 -06:00
Ben Johnson
f53857e1ad Add minimum shadow WAL retention 2022-04-04 21:25:31 -06:00
Ben Johnson
44662022fa Allow read replication recovery from last position 2022-04-04 20:19:02 -06:00
Ben Johnson
8d10881278 Use database page size in read replication 2022-04-02 11:50:30 -06:00
Ben Johnson
00bad4308d Set permission on file replica client on init 2022-03-06 08:38:07 -07:00
Ben Johnson
a090706421 Implement live read replication
This commit adds an http server and client for streaming snapshots
and WAL pages from an upstream Litestream primary to a read-only
replica.
2022-02-19 09:06:49 -07:00
Ben Johnson
6f8cd5a9c4 Configurable monitor-delay-interval
The `monitor-delay-interval` has been added to the DB config so that
users can change the time period between WAL checks after a file
change notification has occurred. This can be useful to batch up
changes in larger files in the shadow WAL or to reduce or eliminate
the delay in propagating changes during read replication.

Setting the interval to zero or less will disable it.
2022-02-18 14:38:50 -07:00
Ben Johnson
8589111717 Implement streaming WAL segment iterator
Currently, WALSegmentIterator implementations read to the end of
the end of their list of segments and return EOF. This commit adds
the ability to push additional segments to in-process iterators and
notify their callers that new segments are available. This is only
implemented for the file-based iterator but other segment iterators
may get this implementation in the future or have a wrapping
iterator provide a polling-based implementation.
2022-02-11 13:50:44 -07:00
Ben Johnson
006e4b7155 Update index & offset encoding
Previously, the index & offsets were encoded as 8-character hex
strings, however, this limits the maximum value to a `uint32`. This
is normally not an issue, however, indices could go over the maximum
value of 4 billion over time and the offset could exceed this value
for an especially large WAL update. For safety, these encodings have
been updated to 16-character hex encodings.
2022-02-08 13:14:49 -07:00
Ben Johnson
30a8d07a81 Add WAL overrun validation
Under high write load, it is possible for write transactions from
another process to overrun the WAL between the time when Litestream
performs a RESTART checkpoint and when it obtains the write lock
immediately after. This change adds validation that an overrun has
not occurred and, if it has, it will start a new generation.
2022-02-07 13:35:20 -07:00
Ben Johnson
76e53dc6ea Remove built-in validation option
Previously, Litestream had a validator that worked most of the time
but also caused some false positives. It is difficult to provide
validation from with Litestream without controlling outside processes
that can also affect the database. As such, validation has been moved
out to the external CI test runner which provides a more consistent
validation process.
2022-02-06 11:37:06 -07:00
Ben Johnson
762c7ae531 Implement FileWatcher 2022-02-06 09:51:04 -07:00
Ben Johnson
4349398ff5 Remove shadow WAL iterator
This commit removes the shadow WAL iterator and replaces it with a
fileWalSegmentIterator instead. This works since the shadow WAL now
has the same structure as the replica WAL. This reduces duplicate
code and will make it so read replication can be daisy chained in
the future.
2022-01-31 16:09:02 -07:00
Ben Johnson
5d811f2e39 Fix golangci-lint issues 2022-01-31 09:21:20 -07:00
Ben Johnson
f6c859061b Fix CodeQL warnings 2022-01-31 08:53:21 -07:00
Ben Johnson
dbdde21341 Use sqlite3_file_control(SQLITE_FCNTL_PERSIST_WAL) to persist WAL
Previously, Litestream would avoid closing the SQLite3 connection
in order to ensure that the WAL file was not cleaned up by the
database if it was the last connection. This commit changes the
behavior by introducing a file control call to perform the same
action. This allows us to close the database file normally in all
cases.
2022-01-28 15:12:43 -07:00
Ben Johnson
84d08f547a Add end-to-end replication/restore testing 2022-01-15 09:05:46 -07:00
Ben Johnson
3f0ec9fa9f Refactor Restore()
This commit refactors out the complexity of downloading ordered WAL
files in parallel to a type called `WALDownloader`. This makes it
easier to test the restore separately from the download.
2022-01-04 15:03:59 -07:00
Ben Johnson
531e19ed6f Refactor checksum calculation; improve test coverage 2021-12-12 10:25:20 -07:00
Ben Johnson
77274abf81 Refactor shadow WAL to use segments 2021-07-23 07:46:21 -06:00
Ben Johnson
fc897b481f Group replica wal segments by index
This commit changes the replica path format to group segments within
a single index in the same directory. This is to eventually add the
ability to seek to a record on file-based systems without having
to iterate over the records. The DB shadow WAL will also be changed
to this same format to support live replicas.
2021-06-14 15:24:05 -06:00
Ben Johnson
55c17b9d8e Move WAL checksum validation message to trace logging
Checksum mismatch can regularly occur now that write locks have
been removed during WAL sync. This does not pose any corruption
risk but does sound scary to end users. Moving this to trace
logging instead.
2021-06-06 09:12:29 -06:00
Ben Johnson
fb80bc10ae Refactor replica system 2021-05-21 07:44:36 -06:00
Ben Johnson
331f6072bf Fix snapshot-only restore
This commit fixes a bug introduced by parallel restore (03831e2)
where snapshot-only restores were not being handled correctly and
Litestream would hang indefinitely. Now the restore will check
explicitly for snapshot-only restores and exit the restore process
early to avoid WAL handling completely.
2021-04-24 07:48:25 -06:00
Ben Johnson
1d1fd6e686 Remove SQLite write lock during WAL sync (again)
This commit reattempts a change to remove the write lock that was
previously tried in 998e831. This change will reduce the number of
locks on the database which should help reduce error messages that
applications see when they do not have busy_timeout set.

In addition to the lock removal, a passive checkpoint is issued
immediately before the read lock is obtained to prevent additional
checkpoints by the application itself. SQLite does not support
checkpoints from an active transaction so it cannot be done afterward.
2021-04-22 16:35:04 -06:00
Ben Johnson
03831e2d06 Download WAL files in parallel during restore
This commit changes the restore to download multiple WAL files to
the local disk in parallel while another goroutine applies those
files in order. Downloading & applying the WAL files in serial
reduces the total throughput as WAL files are typically made up of
multiple small files.
2021-04-21 16:07:29 -06:00
Ben Johnson
1c01af4e69 Fix snapshot selection during restore-by-index
This commit fixes a bug where restoring to a specific index will
incorrectly choose the latest snapshot instead of choosing the
latest snapshot that occurred before the given index.
2021-04-21 12:09:05 -06:00
Ben Johnson
84830bc4ad Improve restoration logging
This commit splits out logging for downloading a WAL file and applying
the WAL file to the database to get more accurate timing measurements.
2021-04-18 09:33:53 -06:00
Ben Johnson
3ad157d841 Remove -dry-run flag in restore
This flag is being removed because it's not actually that useful
in practice and it just makes the restoration code more complicated.
2021-04-18 09:21:50 -06:00
Ben Johnson
247896b8b7 Remove reference to "wal" in first db init command
This commit changes the error message of the first SQL command
executed during initialization. Typically, it wraps the error with
a message of "enable wal" since it is enabling the WAL mode but
that can be confusing if the DB connection or file is invalid.

Instead, the error is returned as-is and we can determine the
source of the error since it is the only unwrapped DB-related error.
2021-04-15 11:51:22 -06:00
Ben Johnson
462330ead6 Support ARM release builds 2021-04-10 08:39:10 -06:00
Ben Johnson
0529ce74b7 Sync on close
This commit changes the `replicate` command so that it performs a
final DB sync & replica sync before it exits to ensure it has
backed up all WAL frames at the time of exit.
2021-03-21 08:43:55 -06:00
Ben Johnson
aa54e4698d Merge pull request #109 from benbjohnson/wal-mismatch-validation-info
Add WAL validation debug information
2021-03-07 07:55:02 -07:00
Ben Johnson
0bd1b13b94 Add wal validation debug information on error
This commit adds the WAL header and shadow path to "wal header mismatch"
errors to help debug issues. The mismatch seems to happen more often
than I would expect on restart. This error doesn't cause any corruption;
it simply causes a generation to restart which requires a snapshot.
2021-03-07 07:48:43 -07:00
Ben Johnson
1c16aae550 Revert sync lock removal
This commit reverts the removal of the SQLite write lock during
WAL sync (998e831c5c). The change
caused validation mismatch errors during the long-running test
although the restored database did not appear to be corrupted so
perhaps it's simply a locking issue during validation.
2021-03-07 07:30:25 -07:00
Ben Johnson
8947adc312 Expose additional DB configuration settings
This commit exposes the monitor interval, checkpoint interval,
minimum checkpoint page count, and maximum checkpoint page count
via the YAML configuration file.
2021-03-06 08:33:19 -07:00
Ben Johnson
998e831c5c Remove SQLite write lock during WAL sync
Originally, Litestream relied on a SQLite write lock to ensure
transactions were atomically replicated. However, this was changed
so that Litestream itself now validates the transaction boundaries.
As such, the write lock on the database is no longer needed. The
read lock is sufficient to prevent WAL rollover and the WAL is
append only so it is safe to read up to a known position calculated
via fstat().

WAL validation change was made in 031a526b9a

The locking code, however, was moved in this commit to the
post-checkpoint copy to ensure the end-of-file is not overwritten
by an aggressive writers.
2021-03-06 07:51:35 -07:00
Ben Johnson
a14a74d678 Fix release of non-OFD locks
This commit removes short-lived `os.Open()` calls on the database
file because this can cause locks to be released when `os.File.Close()`
is later called if the operating system does not support OFD
(Open File Descriptor) locks.
2021-02-28 06:44:02 -07:00
Ben Johnson
d802e15b4f Fix error handling when DB.init() fails
The `DB.init()` can fail temporarily for a variety of reasons such
as the database being locked. Previously, the DB would save the
`*sql.DB` connection even if a step failed and this prevented the
database from attempting initialization again. This change makes it
so that the connection is only saved if initialization is successful.
On failure, the initialization process will be retried on next sync.
2021-02-24 15:43:28 -07:00
Ben Johnson
37442babfb Revert validation mismatch temp file persistence
This commit reverts 4e469f8 which was used for debugging the validation
stall corruption issue. It can cause the disk to fill with temporary
files though so it is being reverted.
2021-02-09 06:44:42 -07:00
Ben Johnson
7f81890bae Fix shadow wal corruption on stalled validation
This commit fixes a timing bug that occurs in a specific scenario
where the shadow wal sync stalls because of an s3 validation and
the catch up write to the shadow wal is large enough to allow a
window between WAL reads and the final copy.

The file copy has been replaced by direct writes of the frame
buffer to the shadow to ensure that every validated byte is exactly
what is being written to the shadow wal. The one downside to this
change is that the frame buffer will grow with the transaction
size so it will use additional heap. This can be replaced by a
spill-to-disk implementation but this should work well in the
short term.
2021-02-06 07:28:15 -07:00
Ben Johnson
6fd11ccab5 Enforce max WAL index.
This commit sets a hard upper limit for the WAL index to (1<<31)-1.
The index is hex-encoded in file names as a 4-byte unsigned integer
so limit ensures all index values are below any upper limit and are
unaffected by any signed int limit.

A WAL file is typically at least 4MB so you would need to write
8 petabytes to reach this upper limit.
2021-02-02 15:11:50 -07:00
Ben Johnson
6c49fba592 Check checkpoint result during restore 2021-02-02 15:04:20 -07:00
Ben Johnson
f17768e830 Log WAL frame checksum mismatch
Currently, the WAL copy function can encounter a checksum mismatch in a
WAL frame and it will return an error. This can occur for partial writes
and is recovered from moments later. This commit changes the error to a
log write instead.
2021-01-31 08:52:12 -07:00
Ben Johnson
4e469f8b02 Persist primary/replica copies after validation mismatch
This commit changes `ValidateReplica()` to persist copies of the
primary & replica databases for inspection if a validation mismatch
occurs.
2021-01-31 08:47:06 -07:00
Ben Johnson
ad7bf7f974 Reduce logging output
Previously, there were excessive log messages for checkpoints and
retention. These have been removed or combined into a single log
message where appropriate.
2021-01-31 08:12:18 -07:00
Ben Johnson
39a6fabb9f Fix restore logging. 2021-01-26 17:01:00 -07:00
Ben Johnson
67eeb49101 Allow replica URL to be used for commands
This commit refactors the commands to allow a replica URL when
restoring a database. If the first CLI arg is a URL with a scheme,
the it is treated as a replica URL.
2021-01-26 16:33:16 -07:00
Ben Johnson
94411923a7 Fix unit test 2021-01-21 13:52:35 -07:00
Ben Johnson
e92db9ef4b Enforce stricter validation on restart.
Previously, the sync would validate the last page written to ensure
that replication picked up from the last position. However, a large
WAL file followed by a series of shorter checkpointed WAL files means
that the last page could be the same even if multiple checkpoints
have occurred.

To fix this, the WAL header must match the shadow WAL header when
starting litestream since there are no guarantees about checkpoints.
2021-01-21 13:44:05 -07:00