This commit reattempts a change to remove the write lock that was
previously tried in 998e831. This change will reduce the number of
locks on the database which should help reduce error messages that
applications see when they do not have busy_timeout set.
In addition to the lock removal, a passive checkpoint is issued
immediately before the read lock is obtained to prevent additional
checkpoints by the application itself. SQLite does not support
checkpoints from an active transaction so it cannot be done afterward.
This commit changes the restore to download multiple WAL files to
the local disk in parallel while another goroutine applies those
files in order. Downloading & applying the WAL files in serial
reduces the total throughput as WAL files are typically made up of
multiple small files.
This commit fixes a bug where restoring to a specific index will
incorrectly choose the latest snapshot instead of choosing the
latest snapshot that occurred before the given index.
This commit changes the error message of the first SQL command
executed during initialization. Typically, it wraps the error with
a message of "enable wal" since it is enabling the WAL mode but
that can be confusing if the DB connection or file is invalid.
Instead, the error is returned as-is and we can determine the
source of the error since it is the only unwrapped DB-related error.
This commit changes the `replicate` command so that it performs a
final DB sync & replica sync before it exits to ensure it has
backed up all WAL frames at the time of exit.
This commit adds the WAL header and shadow path to "wal header mismatch"
errors to help debug issues. The mismatch seems to happen more often
than I would expect on restart. This error doesn't cause any corruption;
it simply causes a generation to restart which requires a snapshot.
This commit reverts the removal of the SQLite write lock during
WAL sync (998e831c5c). The change
caused validation mismatch errors during the long-running test
although the restored database did not appear to be corrupted so
perhaps it's simply a locking issue during validation.
This commit exposes the monitor interval, checkpoint interval,
minimum checkpoint page count, and maximum checkpoint page count
via the YAML configuration file.
Originally, Litestream relied on a SQLite write lock to ensure
transactions were atomically replicated. However, this was changed
so that Litestream itself now validates the transaction boundaries.
As such, the write lock on the database is no longer needed. The
read lock is sufficient to prevent WAL rollover and the WAL is
append only so it is safe to read up to a known position calculated
via fstat().
WAL validation change was made in 031a526b9a
The locking code, however, was moved in this commit to the
post-checkpoint copy to ensure the end-of-file is not overwritten
by an aggressive writers.
This commit removes short-lived `os.Open()` calls on the database
file because this can cause locks to be released when `os.File.Close()`
is later called if the operating system does not support OFD
(Open File Descriptor) locks.
The `DB.init()` can fail temporarily for a variety of reasons such
as the database being locked. Previously, the DB would save the
`*sql.DB` connection even if a step failed and this prevented the
database from attempting initialization again. This change makes it
so that the connection is only saved if initialization is successful.
On failure, the initialization process will be retried on next sync.
This commit reverts 4e469f8 which was used for debugging the validation
stall corruption issue. It can cause the disk to fill with temporary
files though so it is being reverted.
This commit fixes a timing bug that occurs in a specific scenario
where the shadow wal sync stalls because of an s3 validation and
the catch up write to the shadow wal is large enough to allow a
window between WAL reads and the final copy.
The file copy has been replaced by direct writes of the frame
buffer to the shadow to ensure that every validated byte is exactly
what is being written to the shadow wal. The one downside to this
change is that the frame buffer will grow with the transaction
size so it will use additional heap. This can be replaced by a
spill-to-disk implementation but this should work well in the
short term.
This commit sets a hard upper limit for the WAL index to (1<<31)-1.
The index is hex-encoded in file names as a 4-byte unsigned integer
so limit ensures all index values are below any upper limit and are
unaffected by any signed int limit.
A WAL file is typically at least 4MB so you would need to write
8 petabytes to reach this upper limit.
Currently, the WAL copy function can encounter a checksum mismatch in a
WAL frame and it will return an error. This can occur for partial writes
and is recovered from moments later. This commit changes the error to a
log write instead.
Previously, there were excessive log messages for checkpoints and
retention. These have been removed or combined into a single log
message where appropriate.
This commit refactors the commands to allow a replica URL when
restoring a database. If the first CLI arg is a URL with a scheme,
the it is treated as a replica URL.
Previously, the sync would validate the last page written to ensure
that replication picked up from the last position. However, a large
WAL file followed by a series of shorter checkpointed WAL files means
that the last page could be the same even if multiple checkpoints
have occurred.
To fix this, the WAL header must match the shadow WAL header when
starting litestream since there are no guarantees about checkpoints.