Lightest
In Every Way

Light up your models with Lite AI, The next level AI that makes your models lighter, faster, yet smarter.

Apply For Waitlist
  • Stradvision Logo
  • LG Electronics Logo
  • Cheil Logo
  • Kaist Logo

The easiest path
to improve your AI

Problems We Solve

High Spec Hardware Required
Due to Large LLM Model Size

Service providers are required to have high specification hardware to run the large LLM model size, which is not cost-effective and not scalable.

Cost Overrun Due to
One Large AI Model

Service providers usually process all requests through one large AI model, which leads to cost overrun and inefficiency.

Slower Inference Speed
Due to Larger Context Input Size

Inference speed slows down as the context input size increases, leading to increased user dissatisfaction.

Key Features

shoes imageshoes imageshoes imageshoes imageshoes image
shoes imageshoes imageshoes imageshoes imageshoes imageshoes image
shoes imageshoes imageshoes imageshoes imageshoes imageshoes imageshoes image
shoes imageshoes imageshoes imageshoes imageshoes imageshoes image
shoes imageshoes imageshoes imageshoes imageshoes image
Model Compression

Lightweight AIs
on any device

We compress the original model into a much smaller model so that it can run on low-spec hardware. Performance requirements such as accuracy are maintained.

Patent : PA24054KR 20240618 DeepAuto.ai (Provisional Application)

Query Routing

Right models
for right requests

We classify incoming requests through query routers and route them to the optimal LLM model of the same quality, saving up to 95% of costs.

Patent : PA24053KR 20240618 DeepAuto.ai (Provisional Application)

GPT-4o
Claude-3.5
Qwen-2
LLaMA-3.1
from torch import Tensor
from typing import Tuple
from hip import hip_attention

# NOTE: you have to scale the Q before pass to our kernel
scale = 1 / (HID ** 0.5)
# NOTE: we support fused RoPE with SelfExtend (https://github.com/datamllab/LongLM)
rope_method: Literal["none", "self_extend"] = "none"
# NOTE: you need to repeat or extend the tensor to match head size.
position_ids: Optional[Tensor] = \
  position_ids.repeat_interleave(self.num_heads, 0) if rope_method != 'none' else None
"""
- q: Tensor[N*H, TDST, HID]
- k: Tensor[N*H, TSRC, HID]
- v: Tensor[N*H, TSRC, HID]
query, key, value of attention mechanism.
- mask_k: int,
same as $k$ in the paper
- block_size_q: int,
same as $b_q$ in the paper.
- block_size_k: int,
same as $b_k$ in the paper.
- dense_queries: int,
if the $T$ for the given query is shorter than this value, we
will use flash attention instead of ours.
- rope_method: Literal['none', 'self_extend'],
experimental setting to adopt Self-Extend LM paper. seems not
working well, so we did not report this.

- rope_cos, rope_sin, position_ids: Optional[Tensor],
please leave them as None unless you want to use Self-Extend LM
- self_extend_scale: int,
G1 in Self-Extend
- self_extend_window: int,
G2 in Self-Extend
"""

output, _ = hip_attention(q=q * scale,k=k,v=v,mask_k=512,block_size_q=32,block_size_k=2,dense_queries_exp=None if rope_method == 'none' else 0,
Serving System

10 times faster inference

We provide an efficient serving system algorithm to accelerate the Inference speed by 10 times.

Paper : Lee et al, HiP Attention: Sparse Sub-Quadratic Attention with Hierarchical Attention Pruning, deepauto.ai, arXiv preprint

AI for AIs

Lite AI leverages world-class expertise in AI and machine learning from MLAI at KAIST to deliver the most advanced AI solutions for your business.

Apply For Waitlist