Lightest
In Every Way
Light up your models with Lite AI,
The next level AI that makes your models
lighter, faster, yet smarter.
The easiest path
to improve your AI
Problems We Solve
High Spec Hardware Required
Due to Large LLM Model Size
Service providers are required to have high specification hardware to run the large LLM model size, which is not cost-effective and not scalable.
Cost Overrun Due to
One Large AI Model
Service providers usually process all requests through one large AI model, which leads to cost overrun and inefficiency.
Slower Inference Speed
Due to Larger Context Input Size
Inference speed slows down as the context input size increases, leading to increased user dissatisfaction.
Key Features
Lightweight AIs
on any device
We compress the original model into a much smaller model so that it can run on low-spec hardware. Performance requirements such as accuracy are maintained.
Patent : PA24054KR 20240618 DeepAuto.ai (Provisional Application)
Right models
for right requests
We classify incoming requests through query routers and route them to the optimal LLM model of the same quality, saving up to 95% of costs.
Patent : PA24053KR 20240618 DeepAuto.ai (Provisional Application)
from torch import Tensor
from typing import Tuple
from hip import hip_attention
# NOTE: you have to scale the Q before pass to our kernel
scale = 1 / (HID ** 0.5)
# NOTE: we support fused RoPE with SelfExtend (https://github.com/datamllab/LongLM)
rope_method: Literal["none", "self_extend"] = "none"
# NOTE: you need to repeat or extend the tensor to match head size.
position_ids: Optional[Tensor] = \
position_ids.repeat_interleave(self.num_heads, 0) if rope_method != 'none' else None
"""
- q: Tensor[N*H, TDST, HID]
- k: Tensor[N*H, TSRC, HID]
- v: Tensor[N*H, TSRC, HID]
query, key, value of attention mechanism.
- mask_k: int,
same as $k$ in the paper
- block_size_q: int,
same as $b_q$ in the paper.
- block_size_k: int,
same as $b_k$ in the paper.
- dense_queries: int,
if the $T$ for the given query is shorter than this value, we
will use flash attention instead of ours.
- rope_method: Literal['none', 'self_extend'],
experimental setting to adopt Self-Extend LM paper. seems not
working well, so we did not report this.
- rope_cos, rope_sin, position_ids: Optional[Tensor],
please leave them as None unless you want to use Self-Extend LM
- self_extend_scale: int,
G1 in Self-Extend
- self_extend_window: int,
G2 in Self-Extend
"""
output, _ = hip_attention(q=q * scale,k=k,v=v,mask_k=512,block_size_q=32,block_size_k=2,dense_queries_exp=None if rope_method == 'none' else 0,
10 times faster inference
We provide an efficient serving system algorithm to accelerate the Inference speed by 10 times.
Paper : Lee et al, HiP Attention: Sparse Sub-Quadratic Attention with Hierarchical Attention Pruning, deepauto.ai, arXiv preprint
Optimize your growth.
Start free, scale big.
Come vist us
at Austin, Texas
Discover the Apex of Optimization at Amazon unboxed 2024, Amazon's flagship advertising conference. You can also meet our team Optapex, a proud sponsor of unBoxed, to learn how we can help you maximize your Amazon Ads campaigns.