Add documentation for autocompressor (#70)

This commit is contained in:
Azrenbeth
2021-09-28 13:38:01 +01:00
committed by GitHub
parent 9bfd786f3b
commit dcfa67ea98
9 changed files with 401 additions and 102 deletions

107
docs/algorithm.md Normal file
View File

@@ -0,0 +1,107 @@
# Compression algorithm
## What is state?
State is things like who is in a room, what the room topic/name is, who has
what privilege levels etc. Synapse keeps track of it for various reasons such as
spotting invalid events (e.g. ones sent by banned users) and providing room membership
information to clients.
## What is a state group?
Synapse needs to keep track of the state at the moment of each event. A state group
corresponds to a unique state. The database table `event_to_state_groups` keeps track
of the mapping from event ids to state group ids.
Consider the following simplified example:
```
State group id | State
_____________________________________________
1 | Alice in room
2 | Alice in room, Bob in room
3 | Bob in room
Event id | What the event was
______________________________________
1 | Alice sends a message
3 | Bob joins the room
4 | Bob sends a message
5 | Alice leaves the room
6 | Bob sends a message
Event id | State group id
_________________________
1 | 1
2 | 1
3 | 2
4 | 2
5 | 3
6 | 3
```
## What are deltas and predecessors?
When a new state event happens (e.g. Bob joins the room) a new state group is created.
BUT instead of copying all of the state from the previous state group, we just store
the change from the previous group (saving on lots of storage space!). The difference
from the previous state group is called the "delta".
So for the previous example, we would have the following (Note only rows 1 and 2 will
make sense at this point):
```
State group id | Previous state group id | Delta
____________________________________________________________
1 | NONE | Alice in room
2 | 1 | Bob in room
3 | NONE | Bob in room
```
So why is state group 3's previous state group NONE and not 2? Well, the way that deltas
work in Synapse is that they can only add in new state or overwrite old state, but they
cannot remove it. (So if the room topic is changed then that is just overwriting state,
but removing Alice from the room is neither an addition nor an overwriting). If it is
impossible to find a delta, then you just start from scratch again with a "snapshot" of
the entire state.
(NOTE this is not documentation on how synapse handles leaving rooms but is purely for illustrative
purposes)
The state of a state group is worked out by following the previous state group's and adding
together all of the deltas (with the most recent taking precedence).
The mapping from state group to previous state group takes place in `state_group_edges`
and the deltas are stored in `state_groups_state`.
## What are we compressing then?
In order to speed up the conversion from state group id to state, there is a limit of 100
hops set by synapse (that is: we will only ever have to look up the deltas for a maximum of
100 state groups). It does this by taking another "snapshot" every 100 state groups.
However, it is these snapshots that take up the bulk of the storage in a synapse database,
so we want to find a way to reduce the number of them without dramatically increasing the
maximum number of hops needed to do lookups.
## Compression Algorithm
The algorithm works by attempting to create a *tree* of deltas, produced by
appending state groups to different "levels". Each level has a maximum size, where
each state group is appended to the lowest level that is not full. This tool calls a
state group "compressed" once it has been added to
one of these levels.
This produces a graph that looks approximately like the following, in the case
of having two levels with the bottom level (L1) having a maximum size of 3:
```
L2 <-------------------- L2 <---------- ...
^--- L1 <--- L1 <--- L1 ^--- L1 <--- L1 <--- L1
NOTE: A <--- B means that state group B's predecessor is A
```
The structure that synapse creates by default would be equivalent to having one level with
a maximum length of 100.
**Note**: Increasing the sum of the sizes of levels will increase the time it
takes to query the full state of a given state group.

54
docs/python.md Normal file
View File

@@ -0,0 +1,54 @@
# Running the compressor tools from python
Both the automatic and manual tools use PyO3 to allow the compressor
to be run from Python.
To see any output from the tools, logging must be setup in Python before
the compressor is run.
## Setting things up
1. Create a virtual environment in the place you want to use the compressor from
(if it doesn't already exist)
`$ virtualenv -p python3 venv`
2. Activate the virtual environment and install `maturin` (if you haven't already)
`$ source venv/bin/activate`
`$ pip install maturin`
3. Navigate to the correct location
For the automatic tool:
`$ cd /home/synapse/rust-synapse-compress-state/auto_compressor`
For the manual tool:
`$ cd /home/synapse/rust-synapse-compress-state`
3. Build and install the library
`$ maturin develop`
This will install the relevant compressor tool into the activated virtual environment.
## Automatic tool example:
```python
import auto_compressor
auto_compressor.compress_state_events_table(
db_url="postgresql://localhost/synapse",
chunk_size=500,
default_levels="100,50,25",
number_of_chunks=100
)
```
# Manual tool example:
```python
import synapse_compress_state
synapse_compress_state.run_compression(
db_url="postgresql://localhost/synapse",
room_id="!some_room:example.com",
output_file="out.sql",
transactions=True
)
```