Notes on building agentic system for research
How to build domain-specific knowledge base:
We don’t need LlamaParse! ArXiv has the raw tex files of all papers! That can be used to fine-tune a model!
Notes on Grammar checking
Lots of tasks in an academic lab can be broken down into smaller pieces digestable for local LLMs like 70b Llama.
1) Grammer checking example: Local agents can be asked to check the grammar sentence by sentence.
# First, install ollama and run it to let it download the models
from ollama import chat
import re
def check_grammar(sentences):
# Create a prompt for grammar correction
for sent in sentences:
prompt = f"""You are a grammar checker for the LaTex draft of an academic paper. Correct the following sentence for obvious grammar mistakes.
You are not supposed to add quotation marks or modify other LaTex commands. Only return the corrected sentence without explanations or additional text.
Sentence: "{sent}"
Return your answer in a new line, beginning with "Corrected:" """
response = chat(
model='llama3.1', # or any other model you prefer
# model = 'deepseek-r1',
messages=[{'role': 'user', 'content': prompt}],
)
# Extract the corrected text from response
corrected_text = response['message']['content'].strip()
yield corrected_text
def extract_sentences(latex_file):
with open(latex_file, 'r', encoding='utf-8') as f:
text = f.read()
# Split into sentences, remove the space at the begining of each senteces, start from the 4th sentence
sentences = [s.strip() for s in re.split(r'(?<=[.!?])\s+', text)][2:]
return sentences
def process_latex_file(latex_file):
sentences = extract_sentences(latex_file)
for original, corrected in zip(sentences, check_grammar(sentences)):
# First remove the thinking process (everything between and including <think> tags)
corrected = re.sub(r'<think>[\s\S]*?</think>\s*', '', corrected)
# Then extract only what comes after "Corrected:"
corrected = re.sub(r'^.*?Corrected:\s*', '', corrected)
# only display if they are different
if original != corrected:
print(f"old\n{original}\nnew:\n{corrected}\n\n")
latex_file = "main.tex"
process_latex_file(latex_file)
Notes on paper polishing: logic checking, proof-reading:
TODO:
Big picture:
LLM will help academic research labs in two ways:
1) Pulling knowledge from indexed proprietary knowledge base and help humans make decision - This requires the lab members to document their knowledge well. This is a natural extension of lab members organizing their code into code packages / manual books.
2) Using proprietary knowledge and multi-agent systems to build and implement experiments, then write paper drafts.
Agent teams:
1) Administrator team: - Represents the human user, posing questions and approving plans. - Greets human user
2) Science team: - Superconducting qubit specialist - Quantum error correction specialist
2) Programming team: - program user: uses the code for research - Architect: codebase organization - API writer: write requirements - SDE: implement requirements - Data analyst:
3) Writing team: - Paper write: write paper in tone that matches phd students.
An agent is defined by
1) Long-term memory/Knowledge base for RAG: this augments an agent beyond what’s trained into it. This can be implemented by using LlamaParse for breaking down scientific papers, or indexing existing proprietary / commonly used codebase
2) Action space: The action space of different agents can be defined using existing multi-agent frameworks like AutoGen
3) Decision making