1 Commits

Author SHA1 Message Date
cf351be434 更新markdown示例:添加RUG Rust单元测试生成演示内容
- 修改主题为黑色样式以提高可读性
- 更新markdown.md为RUG论文演示内容
- 添加相关图片资源到images目录
- 调整演示尺寸为1920x1080以适应现代显示器
- 移除原有的示例幻灯片,专注于学术演示内容
2025-11-25 10:02:44 +08:00
8 changed files with 94 additions and 114 deletions

Binary file not shown.

After

Width:  |  Height:  |  Size: 28 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 111 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 92 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 79 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 36 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 40 KiB

View File

@@ -7,7 +7,7 @@
<title>reveal.js - Markdown Example</title> <title>reveal.js - Markdown Example</title>
<link rel="stylesheet" href="../dist/reveal.css"> <link rel="stylesheet" href="../dist/reveal.css">
<link rel="stylesheet" href="../dist/theme/white.css" id="theme"> <link rel="stylesheet" href="../dist/theme/black.css" id="theme">
<link rel="stylesheet" href="../plugin/highlight/monokai.css"> <link rel="stylesheet" href="../plugin/highlight/monokai.css">
</head> </head>
@@ -19,7 +19,7 @@
<div class="slides"> <div class="slides">
<!-- Use external markdown resource, separate slides by three newlines; vertical slides by two newlines --> <!-- Use external markdown resource, separate slides by three newlines; vertical slides by two newlines -->
<section style="text-align: left;" data-markdown="markdown.md" data-separator="---" data-separator-vertical="^\n\n"></section> <section data-markdown="markdown.md" data-separator="^\n\n\n" data-separator-vertical="^\n\n"></section>
</div> </div>
</div> </div>

View File

@@ -1,153 +1,133 @@
# JSONite: High-Performance Embedded Database for Semi-Structured Data # RUG: Turbo LLM for Rust Unit Test Generation
--- Keywords: LLM, Rust, Unit Test
## The JSON Performance Crisis Research date: 2022, published date: 2025
**JSON is Everywhere:**
- Web APIS, IoT, logs, configurations
- Semi-structured, flexible, human-readable
**But Current Solutions Fail:**
- **Large Databases**: People use MongoDB or PostgreSQL's JSONB to store data
- **Embeded Databases**: RocksDB and PoloDB lack of ACID and SQL support
- **Serialization to String**: Or serialize JSON into strings and store in SQLite
Serialized JSON with SQL example
```sql #### Introduction
insert into http_request_log (ip, headers)
values ('127.0.0.1', '{
"Content-Type": "application/oct-stream",
"X-Forwarded-For": "100.64.0.1",
}');
```
--- * Unit testing is crucial but costly.
## Introducing JSONite * Rust's strict type system.
**Best of Both Worlds:** * Existing LLM approaches often fail.
- SQLite's based
- Native JSON optimization
**Key Advantages:**
- ✅ ACID compliance
- ✅ SQL simplicity
- ✅ Serverless C library
- ✅ Lightning-fast JSON access
--- #### Rust Unit Test
## Smart Key Optimization ```rust
/// Returns the sum of two numbers
**Key Sorting by Length:** ///
``` /// # Examples
{ ///
"id": 1, /// ```
"address": {...} /// assert_eq!(add(2, 3), 5);
"name": "John", /// assert_eq!(add(-1, 1), 0);
"email": "john@example.com", /// ```
fn add(a: i32, b: i32) -> i32 {
a + b
} }
``` ```
**Sorted as:**
``` #### Challenge
{
"id", (2 chars) ```rust
"name", (4 chars) fn encode<E: Encoder>(&self: char, encoder: E) -> Result<EncodeError> // target function
"email", (5 chars)
"address", (7 chars) impl<W: Writer, C: Config> Encoder for EncoderImpl
}
pub struct EncoderImpl<W: Writer, C: Config>
impl Writer for SliceWriter
impl Writer for IoWriter
impl<T> Config for T where T: R1 + R2 + R3
pub struct Configuration<R1, R2, R3>
``` ```
**Binary search on length → Fast lookups** Simplified python version
--- ```python
def encode(char_data, encoder):
result = encoder.process(char_data)
return result
## Handling Massive Data: Smart TOAST class Encoder:
def __init__(self, writer, config):
self.config = config
def process(self, data):
output = self.writer.write(data, self.config)
return output
**The Oversized-Attribute Storage Technique** class Config:
- Standard approach: arbitrary chunking def __init__(self):
- JSONite's innovation: **Data-Type Aware TOAST** self.settings = {}
**Intelligent Chunking:** config = Config()
- Arrays split between elements encoder = Encoder(stdout, config)
- Objects split between key-value pairs
- Text falls back to fixed chunks
**Enables "Slice Detoasting":** # Test code
- `$.logs[1000000:1000010]` fetches only 10 elements result = encode('A', encoder)
- Not the entire multi-gigabyte array
Smart Chunking Example
```json
{
"id": 1,
"title": "some text",
"html": <pointer to TOAST of 200k text>,
"photos": [<pointer to TOAST of binary data>],
"crawl_logs": [<pointer to TOAST of array of texts>]
}
``` ```
--- LLM generated code are hard to pass the compiler.
## Query Power
**Full SQL + JSON Support:**
- PostgreSQL-compatible JSONB path operators
- GIN indexes for instant search
```sql #### RUG design
SELECT *
FROM accounts
WHERE data @> '{"status": "active"}'
```
--- <img src="./images/Screenshot_20251125_010053.jpeg"
width="75%">
## Performance Validation: Benchmark Datasets
**Three Specialized Workloads:**
1. **YCSB-Style Read Benchmark** <img src="./images/Screenshot_20251125_011029.jpeg" width="80%">
- Yahoo! Cloud Serving Benchmark
- 1M JSON documents (1KB-100KB each)
2. **TPC-C Inspired Update Benchmark** <img src="./images/Screenshot_20251125_011348.jpeg" width="80%">
- Transaction Processing Performance Council
- 100K transactional JSON records
- Frequent small field updates
3. **Large-Array Slice Benchmark**
- Multi-gigabyte JSON documents
- Massive arrays (10M+ elements)
**Comparison Targets:** SQLite JSONB vs MongoDB vs PostgreSQL vs JSONite
--- #### Implementation
## JSONite: The Future of Embedded Data Storage - gpt-3.5-turbo-16k-0613
- gpt-4-1106
- presence penalty set to -1
- frequency_penalty set to 0.5
- temperature set to 1 (by default)
**Why It Matters Today:**
- **Edge Computing**: Lightweight, handles sensor data efficiently
- **Modern Apps**: SQL power + JSON flexibility, no schema migrations
**The Vision:**
- Open source implementation
- Community-driven development
- Becoming the default choice for embedded JSON storage
- Bridging SQL reliability with NoSQL flexibility
--- #### Eval: Comparison with Traditional Tools
## Thank You <img src="./images/Screenshot_20251125_014355.jpeg" width="75%">
**Questions?**
*CHEN Yongyuan* #### Token Consumption
*2025-11-01*
- GPT-4 cost 1000$ in baseline method (send the whole context)
- RUG saved 51.3% tokens (process unique dependency only once)
#### Real-World Usability
> We directly leverage RUG's generated tests, without changing test bodies and send them as PRs to the open source projects.
> To our surprise, the developers are happy to merge these machine generated tests.
> RUG generated a total of 248 unit tests, of which we submitted 113 to the corresponding crates based on their quality and priority.
> So far, 53 of these unit tests have been merged with positive feedback.
> Developers chose not to merge 17 tests for two main reasons:
> first, the target functions are imported from external libraries(16),
> and the developers do not intend to include tests
#### 2025 Situation
<img src="./images/Screenshot_20251125_015416.jpeg" width="60%">
<img src="./images/Screenshot_20251125_015705.jpeg" width="60%">