KAIST and Google AI Present Blockwise Parallel Decoding for Optimized Model Performance
December 27, 2024
Blog
If you have dealt with large language models — specifically autoregressive decoder algorithms — you know the inconvenience of sluggish processing. Novel findings from KAIST and Google researchers on blockwise parallel decoding may offer a solution.
Blockwise Parallel Decoding Adoption Marks a Pivotal Shift
A student research group from Google Research and KAIST AI published a June 2024 paper titled Exploring and Improving Drafts in Blockwise Parallel Decoding. Their work focuses on autoregressive decoders — language models that use previously generated output to predict future elements in a sequence.
While autoregressive decoders are highly effective at generating coherent sequences when data point context is critical, latency is an issue.
Despite significant technological developments, slow inference speeds remain inherent to sequential token generation. Models with hundreds of billions of parameters experience substantial latency.
This is because generating one token at a time significantly increases memory bandwidth utilization and computing resource demand, leading to lag. The exponential delays hinder the algorithm’s ability to perform effectively in real-time.
In blockwise parallel decoding, natural language processing (NLP) algorithms have multiple prediction heads that propose and verify a set of subsequent tokens — a block draft — in parallel. The autoregressive model then selects and conditionally accepts the best matches, accelerating inference speeds.
What Blockwise Parallel Decoding Is Meant to Accomplish
While the research group’s finding that blockwise improves inference speeds is notable, it is not novel — the concept of blockwise parallel decoding was proposed in 2018. Those researchers reduced iterations by a factor of seven with only a slight performance loss.
It has already been effectively proven that simultaneously predicting block drafts — which the autoregressive model subsequently verifies and conditionally accepts — improves generation speed significantly.
This is where the researchers from KAIST AI and Google Research deviate. Their goal was not to exclusively prove accelerated text generation was possible but to enhance this concept.
Based on their observations, the researchers developed two refinement algorithms for block drafts to significantly increase efficiency across various tasks — and without changing the algorithm’s underlying parameters. The first leverages a small, autoregressive language model to rescore local predictions.
Neural rescoring favors fluent, context-relevant sequences, facilitating cohesion and logic across the prediction heads.
The second employs n-gram language models, utilizing dynamic programming to rescore all paths. It generates the most probable as a collection of draft candidates, which the blockwise language model can verify in parallel. This approach's benefit is its efficiency. It can scour an enormous set of possible candidates, so it scales effectively for large language models.
The Cons of Nonautoregressive Decoding Strategies
Despite making strides in blockwise parallel decoding, the research paper’s authors admit that several flaws remain to be addressed.
Limited Contextual Understanding
Parallel prediction heads may struggle with complex contexts where the relationship between tokens is crucial, leading to errors in understanding nuanced or elaborate language. This may noticeably impact output personalization or relevance, which poses an issue.
Many businesses have adopted AI for personalization. According to one survey, 81% of people prefer organizations that personalize the customer experience. They won’t be inclined to use this method or the new algorithms to improve output generation speed.
Varying Prediction Confidence Levels
Language models guarantee consistency with all previous tokens by generating subsequent tokens sequentially. Since blockwise parallel decoding leverages a parallel strategy, coherent sequences are less likely.
Concurrently generated tokens — specifically vanilla block drafts — tend to be unorganized, unnatural or inaccurate due to varying confidence levels in predictions. As a result, the autoregressive model may not accept them.
Tendency to Repeat Block Drafts
Vanilla block drafts tend to contain consecutive repetitions because each head’s prediction happens separately from the others. Unsurprisingly, they often generate the same output since they propose tokens in isolation.
The KAIST AI and Google Research team discovered that 20%-75% of the neighboring draft tokens were repeats — the exact percentage varied depending on the task. However, the trend was consistent across all functions to some degree.
The Pros of Nonautoregressive Decoding Strategies
While there are still issues to address, this non-autoregressive decoding strategy holds promise.
Accelerates Text Generation Speed
Blockwise parallel decoding consistently decreases computing demand and memory bandwidth utilization, meaning engineers no longer waste time waiting for sequential tokens. The researchers successfully accelerated text generation across tasks.
The research group found that refined block drafts improve block efficiency — the number of accepted tokens from the draft — by 5%-21% across various datasets. This method also optimizes resource usage.
Offers Alternatives to Other Mechanisms
The emergence of non-autoregressive decoding strategies that address inference latency marks a turning point in AI engineering.
Approaches like tree-based attention mechanisms have endeavored to refine block drafts. For instance, tree-of-thought prompting improves confidence, enabling models to achieve 74% accuracy — compared to the subpar 4% accuracy of chain of thought.
However, these mechanisms often require intensive training or an abundance of inference data, making them impractical. The blockwise approach offers an effective solution for numerous use cases.
Provides Insights for Future Research
The n-gram and neural algorithms the researchers developed balance performance with efficiency, improving upon an already promising concept. Crucially, they offer insights into optimizing decoding techniques — even in the face of resource constraints.
Given context and a prompt, most large language models only predict the next token in a sequence. Most would benefit from blockwise parallel decoding and the two novel optimization algorithms.
Optimizing Model Performance With This Method
Whether you have a use for a non-autoregressive decoding strategy or not, you should be able to recognize the potential of performance optimization via parallel decoding. This research could serve as the basis for revolutionary strides in this domain.
Eleanor Hecks is a writer with 8+ years of experience contributing to publications like freeCodeCamp, Smashing Magazine, and Fast Company. You can find her work as Editor-in-Chief of Designerly Magazine, or keep up with her on LinkedIn.