Folia

mirror of https://github.com/PaperMC/Folia.git synced 2025-02-03 23:31:20 +01:00

History

Spottedleaf 31b5b1575b Use coordinate-based locking to increase chunk system parallelism A significant overhead in Folia comes from the chunk system's locks, the ticket lock and the scheduling lock. The public test server, which had ~330 players, had signficant performance problems with these locks: ~80% of the time spent ticking was _waiting_ for the locks to free. Given that it used around 15 cores total at peak, this is a complete and utter loss of potential. To address this issue, I have replaced the ticket lock and scheduling lock with two ReentrantAreaLocks. The ReentrantAreaLock takes a shift, which is used internally to group positions into sections. This grouping is neccessary, as the possible radius of area that needs to be acquired for any given lock usage is up to 64. As such, the shift is critical to reduce the number of areas required to lock for any lock operation. Currently, it is set to a shift of 6, which is identical to the ticket level propagation shift (and, it must be at least the ticket level propagation shift AND the region shift). The chunk system locking changes required a complete rewrite of the chunk system tick, chunk system unload, and chunk system ticket level propagation - as all of the previous logic only works with a single global lock. This does introduce two other section shifts: the lock shift, and the ticket shift. The lock shift is simply what shift the area locks use, and the ticket shift represents the size of the ticket sections. Currently, these values are just set to the region shift for simplicity. However, they are not arbitrary: the lock shift must be at least the size of the ticket shift and must be at least the size of the region shift. The ticket shift must also be >= the ceil(log2(max ticket level source)). The chunk system's ticket propagator is now global state, instead of region state. This cleans up the logic for ticket levels significantly, and removes usage of the region lock in this area, but it also means that the addition of a ticket no longer creates a region. To alleviate the side effects of this change, the global tick thread now processes ticket level updates for each world every tick to guarantee eventual ticket level processing. The chunk system also provides a hook to process ticket level changes in a given _section_, so that the region queue can guarantee that after adding its reference counter that the region section is created/exists/wont be destroyed. The ticket propagator operates by updating the sources in a single ticket section, and propagating the updates to its 1 radius neighbours. This allows the ticket updates to occur in parallel or selectively (see above). Currently, the process ticket level update function operates by polling from a concurrent queue of sections to update and simply invoking the single section update logic. This allows the function to operate completely in parallel, provided the queue is ordered right. Additionally, this limits the area used in the ticket/scheduling lock when processing updates, which should massively increase parallelism compared to before. The chunk system ticket addition for expirable ticket types has been modified to no longer track exact tick deadlines, as this relies on what region the ticket is in. Instead, the chunk system tracks a map of lock section -> (chunk coordinate -> expire ticket count) and every ticket has been changed to have a removeDelay count that is decremented each tick. Each region searches its own sections to find tickets to try to expire. Chunk system unloading has been modified to track unloads by lock section. The ordering is determined by which section a chunk resides in. The unload process now removes from unload sections and processes the full unload stages (1, 2, 3) before moving to the next section, if possible. This allows the unload logic to only hold one lock section at a time for each lock, which is a massive parallelism increase. In stress testing, these changes lowered the locking overhead to only 5% from ~70%, which completely fix the original problem as described.	2023-05-14 19:46:24 -07:00
..
api	Undo making JavaPlugin#logger field public (see PaperMC/Paper#9125) (#76 )	2023-05-14 18:10:49 -07:00
server	Use coordinate-based locking to increase chunk system parallelism	2023-05-14 19:46:24 -07:00

Spottedleaf 31b5b1575b Use coordinate-based locking to increase chunk system parallelism

A significant overhead in Folia comes from the chunk system's
locks, the ticket lock and the scheduling lock. The public
test server, which had ~330 players, had signficant performance
problems with these locks: ~80% of the time spent ticking
was _waiting_ for the locks to free. Given that it used
around 15 cores total at peak, this is a complete and utter loss
of potential.

To address this issue, I have replaced the ticket lock and scheduling
lock with two ReentrantAreaLocks. The ReentrantAreaLock takes a
shift, which is used internally to group positions into sections.
This grouping is neccessary, as the possible radius of area that
needs to be acquired for any given lock usage is up to 64. As such,
the shift is critical to reduce the number of areas required to lock
for any lock operation. Currently, it is set to a shift of 6, which
is identical to the ticket level propagation shift (and, it must be
at least the ticket level propagation shift AND the region shift).

The chunk system locking changes required a complete rewrite of the
chunk system tick, chunk system unload, and chunk system ticket level
propagation - as all of the previous logic only works with a single
global lock.

This does introduce two other section shifts: the lock shift, and the
ticket shift. The lock shift is simply what shift the area locks use,
and the ticket shift represents the size of the ticket sections.
Currently, these values are just set to the region shift for simplicity.
However, they are not arbitrary: the lock shift must be at least the size
of the ticket shift and must be at least the size of the region shift.
The ticket shift must also be >= the ceil(log2(max ticket level source)).

The chunk system's ticket propagator is now global state, instead of
region state. This cleans up the logic for ticket levels significantly,
and removes usage of the region lock in this area, but it also means
that the addition of a ticket no longer creates a region. To alleviate
the side effects of this change, the global tick thread now processes
ticket level updates for each world every tick to guarantee eventual
ticket level processing. The chunk system also provides a hook to
process ticket level changes in a given _section_, so that the
region queue can guarantee that after adding its reference counter
that the region section is created/exists/wont be destroyed.

The ticket propagator operates by updating the sources in a single ticket
section, and propagating the updates to its 1 radius neighbours. This
allows the ticket updates to occur in parallel or selectively (see above).
Currently, the process ticket level update function operates by
polling from a concurrent queue of sections to update and simply
invoking the single section update logic. This allows the function
to operate completely in parallel, provided the queue is ordered right.
Additionally, this limits the area used in the ticket/scheduling lock
when processing updates, which should massively increase parallelism compared
to before.

The chunk system ticket addition for expirable ticket types has been modified
to no longer track exact tick deadlines, as this relies on what region the
ticket is in. Instead, the chunk system tracks a map of
lock section -> (chunk coordinate -> expire ticket count) and every ticket
has been changed to have a removeDelay count that is decremented each tick.
Each region searches its own sections to find tickets to try to expire.

Chunk system unloading has been modified to track unloads by lock section.
The ordering is determined by which section a chunk resides in.
The unload process now removes from unload sections and processes
the full unload stages (1, 2, 3) before moving to the next section, if possible.
This allows the unload logic to only hold one lock section at a time for
each lock, which is a massive parallelism increase.

In stress testing, these changes lowered the locking overhead to only 5%
from ~70%, which completely fix the original problem as described.

2023-05-14 19:46:24 -07:00

api

Undo making JavaPlugin#logger field public (see PaperMC/Paper#9125) (#76 )

2023-05-14 18:10:49 -07:00

server

Use coordinate-based locking to increase chunk system parallelism

2023-05-14 19:46:24 -07:00