Lightest
In Every Way

Light up your models with Lite AI,
The next level AI that makes your models
lighter, faster, yet smarter.

The easiest path
to improve your AI

Problems We Solve

High Spec Hardware Required
Due to Large LLM Model Size

Service providers are required to have high specification hardware to run the large LLM model size, which is not cost-effective and not scalable.

Cost Overrun Due to
One Large AI Model

Service providers usually process all requests through one large AI model, which leads to cost overrun and inefficiency.

Slower Inference Speed
Due to Larger Context Input Size

Inference speed slows down as the context input size increases, leading to increased user dissatisfaction.

Key Features

Model Compression

Lightweight AIs
on any device

We compress the original model into a much smaller model so that it can run on low-spec hardware. Performance requirements such as accuracy are maintained.

Patent : PA24054KR 20240618 DeepAuto.ai (Provisional Application)

Query Routing

Right models
for right requests

We classify incoming requests through query routers and route them to the optimal LLM model of the same quality, saving up to 95% of costs.

Patent : PA24053KR 20240618 DeepAuto.ai (Provisional Application)

GPT-4o

Claude-3.5

Qwen-2

LLaMA-3.1

from torch import Tensor
from typing import Tuple
from hip import hip_attention

# NOTE: you have to scale the Q before pass to our kernel
scale = 1 / (HID ** 0.5)
# NOTE: we support fused RoPE with SelfExtend (https://github.com/datamllab/LongLM)
rope_method: Literal["none", "self_extend"] = "none"
# NOTE: you need to repeat or extend the tensor to match head size.
position_ids: Optional[Tensor] = \
  position_ids.repeat_interleave(self.num_heads, 0) if rope_method != 'none' else None
"""
- q: Tensor[N*H, TDST, HID]
- k: Tensor[N*H, TSRC, HID]
- v: Tensor[N*H, TSRC, HID]
    query, key, value of attention mechanism.
- mask_k: int, 
    same as $k$ in the paper
- block_size_q: int, 
    same as $b_q$ in the paper. 
- block_size_k: int, 
    same as $b_k$ in the paper.
- dense_queries: int, 
    if the $T$ for the given query is shorter than this value, we 
    will use flash attention instead of ours.
- rope_method: Literal['none', 'self_extend'], 
    experimental setting to adopt Self-Extend LM paper. seems not 
    working well, so we did not report this.

- rope_cos, rope_sin, position_ids: Optional[Tensor], 
    please leave them as None unless you want to use Self-Extend LM
- self_extend_scale: int, 
    G1 in Self-Extend
- self_extend_window: int, 
    G2 in Self-Extend
"""

output, _ = hip_attention(q=q * scale,k=k,v=v,mask_k=512,block_size_q=32,block_size_k=2,dense_queries_exp=None if rope_method == 'none' else 0,

Serving System

10 times faster inference

We provide an efficient serving system algorithm to accelerate the Inference speed by 10 times.

Paper : Lee et al, HiP Attention: Sparse Sub-Quadratic Attention with Hierarchical Attention Pruning, deepauto.ai, arXiv preprint

AI for AIs

Lite AI leverages world-class expertise in AI and machine learning
from MLAI at KAIST to deliver the most advanced AI solutions for your business.

Apply For Waitlist

Lightest
In Every Way

The easiest path
to improve your AI

Problems We Solve

High Spec Hardware Required
Due to Large LLM Model Size

Cost Overrun Due to
One Large AI Model

Slower Inference Speed
Due to Larger Context Input Size

Key Features

Lightweight AIs
on any device

Right models
for right requests

10 times faster inference

Optimize your growth.
Start free, scale big.

$129.99$0

$459.99~

Come vist us
at Austin, Texas

AI for AIs

LightestIn Every Way

The easiest pathto improve your AI

Problems We Solve

High Spec Hardware RequiredDue to Large LLM Model Size

Cost Overrun Due toOne Large AI Model

Slower Inference SpeedDue to Larger Context Input Size

Key Features

Lightweight AIson any device

Right modelsfor right requests

10 times faster inference

Optimize your growth.Start free, scale big.

$129.99$0

$459.99~

Come vist usat Austin, Texas

AI for AIs

Lightest
In Every Way

The easiest path
to improve your AI

High Spec Hardware Required
Due to Large LLM Model Size

Cost Overrun Due to
One Large AI Model

Slower Inference Speed
Due to Larger Context Input Size

Lightweight AIs
on any device

Right models
for right requests

Optimize your growth.
Start free, scale big.

Come vist us
at Austin, Texas