返回 Feed

Build a Multi-Agent AI Workflow for Biological Network Modeling, Protein Interactions, Metabolism, and Cell Signaling Simulation

RSS·MarkTechPost·Asif Razzaq·2 天前
查看原文

AI 分析摘要

8AI 工作流

本教程介紹了一個多代理人 AI 工作流程,適用於生物網路建模、蛋白質相互作用、代謝和細胞信號模擬。讀者可以使用 OpenAI 模型作為主要調查員,整合各個代理人的輸出,生成一個專家風格的生物學解釋。這個工作流程可以在 Colab 環境中實現,具有實用性和可複製性。

💡 評分理由:提供了一個具體可用的工作流程和可直接複製的代碼,具有實質功能變化和參考價值。

模型:meta/llama-3.3-70b-instruct · Prompt: v2

原文內容(清洗後)

In this tutorial, we build a multi-agent workflow for biological systems modeling and explore how different computational components work together inside one unified systems biology pipeline. We generate synthetic biological data, analyze gene regulatory structure, predict protein-protein interactions, optimize metabolic pathway activity, and simulate a dynamic cell signaling cascade, all within a Colab environment that remains practical and reproducible. We also use an OpenAI model to act as a principal investigator, synthesizing the outputs of all specialized agents into a single expert-style biological interpretation that connects regulation, interaction networks, metabolism, and signaling into a broader scientific story. Copy Code Copied Use a different Browser import sys, subprocess, pkgutil def _install_if_missing(packages): missing = [] for p in packages: import_name = p["import"] if pkgutil.find_loader(import_name) is None: missing.append(p["pip"]) if missing: print("Installing:", ", ".join(missing)) subprocess.check_call([sys.executable, "-m", "pip", "install", "-q"] + missing) _install_if_missing([ {"pip": "openai", "import": "openai"}, {"pip": "numpy", "import": "numpy"}, {"pip": "pandas", "import": "pandas"}, {"pip": "matplotlib", "import": "matplotlib"}, {"pip": "networkx", "import": "networkx"}, {"pip": "scikit-learn", "import": "sklearn"}, ]) import os import json import math import textwrap import random import getpass from dataclasses import dataclass from typing import Dict, List, Tuple, Any import numpy as np import pandas as pd import matplotlib.pyplot as plt import networkx as nx from sklearn.linear_model import LogisticRegression from sklearn.metrics import roc_auc_score, average_precision_score from sklearn.model_selection import train_test_split from sklearn.preprocessing import StandardScaler from openai import OpenAI np.random.seed(42) random.seed(42) OPENAI_API_KEY = None try: from google.colab import userdata OPENAI_API_KEY = userdata.get("OPENAI_API_KEY") if OPENAI_API_KEY: print("Loaded OPENAI_API_KEY from Colab Secrets.") except Exception: pass if not OPENAI_API_KEY: try: OPENAI_API_KEY = getpass.getpass("Enter OPENAI_API_KEY (hidden input): ").strip() except Exception: OPENAI_API_KEY = input("Enter OPENAI_API_KEY: ").strip() os.environ["OPENAI_API_KEY"] = OPENAI_API_KEY client = OpenAI(api_key=OPENAI_API_KEY) OPENAI_MODEL = "gpt-4o-mini" We prepare the Colab environment and make sure all required libraries are available before the workflow begins. We import the scientific computing, machine learning, graph analysis, plotting, and OpenAI libraries that support the full biological systems pipeline from start to finish. We also securely load the OpenAI API key either from Colab Secrets or hidden input, initialize the client, and define the model so the notebook is ready for later LLM-based synthesis. Copy Code Copied Use a different Browser def sigmoid(x): return 1 / (1 + np.exp(-x)) def pretty(title: str, body: str, width: int = 100): print("\n" + "=" * width) print(title) print("=" * width) print(body) def safe_float(x): try: return float(x) except Exception: return None def generate_gene_regulatory_network(n_genes: int = 14, edge_prob: float = 0.18): genes = [f"G{i+1}" for i in range(n_genes)] W = np.zeros((n_genes, n_genes)) for i in range(n_genes): for j in range(n_genes): if i != j and np.random.rand() < edge_prob: W[i, j] = np.random.uniform(-1.5, 1.5) return genes, W def simulate_gene_expression(W: np.ndarray, n_steps: int = 70, noise: float = 0.10): n = W.shape[0] X = np.zeros((n_steps, n)) X[0] = np.random.uniform(0.2, 0.8, size=n) for t in range(1, n_steps): signal = X[t-1] @ W X[t] = sigmoid(signal + np.random.normal(0, noise, size=n)) return X def generate_protein_features(n_proteins: int = 40, feature_dim: int = 10): proteins = [f"P{i+1}" for i in range(n_proteins)] features = np.random.normal(size=(n_proteins, feature_dim)) families = np.random.randint(0, 5, size=n_proteins) localization = np.random.randint(0, 4, size=n_proteins) return proteins, features, families, localization def generate_ppi_dataset(proteins, features, families, localization): rows = [] n = len(proteins) hidden_w = np.random.normal(size=features.shape[1]) for i in range(n): for j in range(i + 1, n): fi, fj = features[i], features[j] sim = np.dot(fi, fj) / (np.linalg.norm(fi) * np.linalg.norm(fj) + 1e-8) fam_same = 1 if families[i] == families[j] else 0 loc_same = 1 if localization[i] == localization[j] else 0 feat = np.concatenate([ np.abs(fi - fj), fi * fj, [sim, fam_same, loc_same] ]) score = 1.4 * sim + 1.0 * fam_same + 0.8 * loc_same + 0.15 * np.dot((fi + fj) / 2, hidden_w) prob = sigmoid(score) y = 1 if np.random.rand() < prob else 0 rows.append((proteins[i], proteins[j], feat, y)) return rows def generate_metabolic_network(): metabolites = ["Glucose", "Pyruvate", "AcetylCoA", "ATP", "Biomass", "Lactate", "Ethanol"] reactions = [ {"name": "R1_Glucose_Uptake", "yield_biomass": 0.0, "yield_atp": 0.3, "substrate_cost": 1.0, "oxygen_need": 0.0}, {"name": "R2_Glycolysis", "yield_biomass": 0.2, "yield_atp": 1.6, "substrate_cost": 0.7, "oxygen_need": 0.0}, {"name": "R3_TCA", "yield_biomass": 1.0, "yield_atp": 2.4, "substrate_cost": 0.8, "oxygen_need": 1.4}, {"name": "R4_Fermentation", "yield_biomass": 0.1, "yield_atp": 0.9, "substrate_cost": 0.4, "oxygen_need": 0.0}, {"name": "R5_Ethanol_Path", "yield_biomass": 0.15,"yield_atp": 0.8, "substrate_cost": 0.5, "oxygen_need": 0.0}, {"name": "R6_Biomass_Assembly","yield_biomass": 1.3, "yield_atp": -0.9,"substrate_cost": 0.6, "oxygen_need": 0.2}, ] return metabolites, reactions def simulate_cell_signaling(T=200, dt=0.05, ligand_level=1.2): t = np.arange(0, T * dt, dt) ligand = np.ones_like(t) * ligand_level receptor = np.zeros_like(t) kinase = np.zeros_like(t) tf = np.zeros_like(t) phosphatase = np.zeros_like(t) receptor[0] = 0.05 kinase[0] = 0.02 tf[0] = 0.01 phosphatase[0] = 0.30 for i in range(1, len(t)): dR = 1.6 * ligand[i-1] * (1 - receptor[i-1]) - 0.9 * receptor[i-1] dK = 1.8 * receptor[i-1] * (1 - kinase[i-1]) - 1.1 * phosphatase[i-1] * kinase[i-1] dTF = 1.4 * kinase[i-1] * (1 - tf[i-1]) - 0.55 * tf[i-1] dP = 0.2 + 0.5 * tf[i-1] - 0.4 * phosphatase[i-1] receptor[i] = np.clip(receptor[i-1] + dt * dR, 0, 1) kinase[i] = np.clip(kinase[i-1] + dt * dK, 0, 1) tf[i] = np.clip(tf[i-1] + dt * dTF, 0, 1) phosphatase[i] = np.clip(phosphatase[i-1] + dt * dP, 0, 1.5) return pd.DataFrame({ "time": t, "ligand": ligand, "receptor_active": receptor, "kinase_active": kinase, "tf_active": tf, "phosphatase": phosphatase, }) We define the main helper utilities and all synthetic data generation functions that power the notebook s biological tasks. We create functions for gene regulatory network construction, gene expression simulation, protein feature generation, protein interaction dataset creation, metabolic network setup, and cell signaling dynamics, which together provide four distinct biological views for analysis. This snippet forms the computational backbone of the tutorial by creating the structured inputs that each specialized agent will later process and interpret. Copy Code Copied Use a different Browser @dataclass class AgentResult: name: str summary: Dict[str, Any] class GeneRegulatoryNetworkAgent: def run(self, genes, W, X) -> AgentResult: corr = np.corrcoef(X.T) inferred_edges = [] true_edges = [] n = len(genes) for i in range(n): for j in range(n): if i == j: continue if abs(corr[i, j]) > 0.35: inferred_edges.append((genes[i], genes[j], float(corr[i, j]))) if abs(W[i, j]) > 1e-8: true_edges.append((genes[i], genes[j], float(W[i, j]))) centrality_graph = nx.DiGraph() for gi in genes: centrality_graph.add_node(gi) for i in range(n): for j in range(n): if abs(W[i, j]) > 1e-8: centrality_graph.add_edge(genes[i], genes[j], weight=float(W[i, j])) out_deg = dict(centrality_graph.out_degree()) in_deg = dict(centrality_graph.in_degree()) hubs = sorted(out_deg.items(), key=lambda x: x[1], reverse=True)[:5] sinks = sorted(in_deg.items(), key=lambda x: x[1], reverse=True)[:5] dynamic_var = X.var(axis=0) most_dynamic = sorted(zip(genes, dynamic_var), key=lambda x: x[1], reverse=True)[:5] summary = { "num_genes": n, "num_true_regulatory_edges": len(true_edges), "num_inferred_associations": len(inferred_edges), "top_hub_genes": [{"gene": g, "out_degree": int(d)} for g, d in hubs], "top_sink_genes": [{"gene": g, "in_degree": int(d)} for g, d in sinks], "most_dynamic_genes": [{"gene": g, "variance": round(float(v), 4)} for g, v in most_dynamic], "sample_inferred_edges": [ {"source": a, "target": b, "association": round(c, 3)} for a, b, c in inferred_edges[:10] ], "expression_tail_mean": round(float(X[-10:].mean()), 4), } return AgentResult(name="GeneRegulatoryNetworkAgent", summary=summary) We define the shared result container and the gene regulatory network agent that analyzes regulatory structure and expression behavior. We use correlation-based association inference, true-edge extraction, degree-based graph analysis, and variance-based ranking to identify hub, sink, and highly dynamic genes across simulated expression trajectories. This gives us a network-level picture of how regulatory influence may be distributed in the system and helps us identify important candidate regulators for downstream interpretation. Copy Code Copied Use a different Browser class ProteinInteractionPredictionAgent: def run(self, ppi_rows) -> AgentResult: X = np.vstack([r[2] for r in ppi_rows]) y = np.array([r[3] for r in ppi_rows]) scaler = StandardScaler() X_train_s = scaler.fit_transform(X_train) X_test_s = scaler.transform(X_test) clf = LogisticRegression(max_iter=1000) clf.fit(X_train_s, y_train) probs = clf.predict_proba(X_test_s)[:, 1] auc = roc_auc_score(y_test, probs) ap = average_precision_score(y_test, probs) scored_pairs = [] Xt_full = scaler.transform(X) full_probs = clf.predict_proba(Xt_full)[:, 1] for (p1, p2, _, label), pr in zip(ppi_rows, full_probs): scored_pairs.append((p1, p2, float(pr), int(label))) top_candidates = sorted(scored_pairs, key=lambda x: x[2], reverse=True)[:10] positive_rate = float(y.mean()) summary = { "num_pairs": int(len(ppi_rows)), "positive_interaction_rate": round(positive_rate, 4), "test_roc_auc": round(float(auc), 4), "test_average_precision": round(float(ap), 4), "top_predicted_interactions": [ {"protein_a": a, "protein_b": b, "pred_prob": round(pr, 4), "label": lab} for a, b, pr, lab in top_candidates ], } return AgentResult(name="ProteinInteractionPredictionAgent", summary=summary) class MetabolicOptimizationAgent: def run(self, reactions, oxygen_budget=3.5, substrate_budget=4.0): best_score = -1e9 best_flux = None trace = [] for _ in range(8000): flux = np.random.dirichlet(np.ones(len(reactions))) * np.random.uniform(1.5, 5.0) oxygen = sum(f["oxygen_need"] * v for f, v in zip(reactions, flux)) substrate = sum(f["substrate_cost"] * v for f, v in zip(reactions, flux)) atp = sum(f["yield_atp"] * v for f, v in zip(reactions, flux)) biomass = sum(f["yield_biomass"] * v for f, v in zip(reactions, flux)) penalty = 0.0 if oxygen > oxygen_budget: penalty += 6.0 * (oxygen - oxygen_budget) if substrate > substrate_budget: penalty += 6.0 * (substrate - substrate_budget) score = 2.2 * biomass + 0.6 * atp - penalty trace.append(score) if score > best_score: best_score = score best_flux = { "oxygen": oxygen, "substrate": substrate, "atp": atp, "biomass": biomass, "fluxes": {reactions[i]["name"]: float(flux[i]) for i in range(len(reactions))} } ranked_fluxes = sorted(best_flux["fluxes"].items(), key=lambda x: x[1], reverse=True) summary = { "oxygen_budget": oxygen_budget, "substrate_budget": substrate_budget, "best_objective_score": round(float(best_score), 4), "best_biomass": round(float(best_flux["biomass"]), 4), "best_atp": round(float(best_flux["atp"]), 4), "oxygen_used": round(float(best_flux["oxygen"]), 4), "substrate_used": round(float(best_flux["substrate"]), 4), "dominant_reactions": [ {"reaction": name, "flux": round(val, 4)} for name, val in ranked_fluxes[:6] ], } return AgentResult(name="MetabolicOptimizationAgent", summary=summary), trace We define the protein interaction prediction agent and the metabolic optimization agent, which together expand the analysis beyond regulation into interaction biology and pathway allocation. We train a logistic regression classifier on synthetic pairwise protein features to estimate interaction probabilities, evaluate predictive performance, and rank the strongest candidate protein pairs. We also run a randomized flux search under oxygen and substrate constraints to identify metabolically favorable reaction allocations, allowing us to study how the system balances biomass growth, ATP production, and resource limitations. Copy Code Copied Use a different Browser class CellSignalingSimulationAgent: def run(self, df_signal: pd.DataFrame) -> AgentResult: peak_receptor = float(df_signal["receptor_active"].max()) peak_kinase = float(df_signal["kinase_active"].max()) peak_tf = float(df_signal["tf_active"].max()) t_receptor = float(df_signal.loc[df_signal["receptor_active"].idxmax(), "time"]) t_kinase = float(df_signal.loc[df_signal["kinase_active"].idxmax(), "time"]) t_tf = float(df_signal.loc[df_signal["tf_active"].idxmax(), "time"]) final_state = df_signal.iloc[-1].to_dict() summary = { "peak_receptor_activity": round(peak_receptor, 4), "peak_kinase_activity": round(peak_kinase, 4), "peak_tf_activity": round(peak_tf, 4), "time_to_peak_receptor": round(t_receptor, 4), "time_to_peak_kinase": round(t_kinase, 4), "time_to_peak_tf": round(t_tf, 4), "final_state": {k: round(float(v), 4) for k, v in final_state.items()}, } return AgentResult(name="CellSignalingSimulationAgent", summary=summary) class PrincipalInvestigatorAgent: def __init__(self, client, model=OPENAI_MODEL): self.client = client self.model = model def synthesize(self, results: List[AgentResult]) -> str: payload = {r.name: r.summary for r in results} prompt = f""" You are a principal investigator in computational systems biology. Given the outputs of four specialized AI agents: 1. gene regulatory network analysis 2. protein interaction prediction 3. metabolic pathway optimization 4. cell signaling simulation Write a rigorous but readable report with these sections: - Executive Summary - Key Findings by Agent - Cross-System Biological Interpretation - Hypotheses Worth Testing in Wet Lab - Model Limitations - Next Computational Extensions Use concise scientific language. Do not fabricate datasets beyond what is shown. When useful, connect regulation, signaling, metabolism, and protein interactions into a single systems biology story. Agent outputs: {json.dumps(payload, indent=2)} """ try: resp = self.client.chat.completions.create( model=self.model, messages=[ {"role": "user", "content": prompt}, ], temperature=0.4, ) return resp.choices[0].message.content except Exception as e: return f"OpenAI synthesis failed: {e}" genes, W = generate_gene_regulatory_network(n_genes=14, edge_prob=0.20) X_expr = simulate_gene_expression(W, n_steps=80, noise=0.08) grn_agent = GeneRegulatoryNetworkAgent() grn_result = grn_agent.run(genes, W, X_expr) proteins, prot_features, prot_families, prot_localization = generate_protein_features(n_proteins=40, feature_dim=10) ppi_rows = generate_ppi_dataset(proteins, prot_features, prot_families, prot_localization) ppi_agent = ProteinInteractionPredictionAgent() ppi_result = ppi_agent.run(ppi_rows) metabolites, reactions = generate_metabolic_network() met_agent = MetabolicOptimizationAgent() met_result, met_trace = met_agent.run(reactions, oxygen_budget=3.5, substrate_budget=4.2) df_signal = simulate_cell_signaling(T=220, dt=0.05, ligand_level=1.2) sig_agent = CellSignalingSimulationAgent() sig_result = sig_agent.run(df_signal) all_results = [grn_result, ppi_result, met_result, sig_result] for r in all_results: pretty(r.name, json.dumps(r.summary, indent=2)) fig = plt.figure(figsize=(18, 14)) ax1 = plt.subplot(2, 2, 1) im = ax1.imshow(W, cmap="coolwarm", aspect="auto") ax1.set_title("Gene Regulatory Weight Matrix") ax1.set_xticks(range(len(genes))) ax1.set_yticks(range(len(genes))) ax1.set_xticklabels(genes, rotation=90) ax1.set_yticklabels(genes) plt.colorbar(im, ax=ax1, fraction=0.046, pad=0.04) ax2 = plt.subplot(2, 2, 2) for i in range(min(6, X_expr.shape[1])): ax2.plot(X_expr[:, i], label=genes[i]) ax2.set_title("Sample Gene Expression Dynamics") ax2.set_xlabel("Time step") ax2.set_ylabel("Expression") ax2.legend(loc="upper right", fontsize=8) ax3 = plt.subplot(2, 2, 3) ax3.plot(df_signal["time"], df_signal["receptor_active"], label="Receptor") ax3.plot(df_signal["time"], df_signal["kinase_active"], label="Kinase") ax3.plot(df_signal["time"], df_signal["tf_active"], label="Transcription Factor") ax3.plot(df_signal["time"], df_signal["phosphatase"], label="Phosphatase") ax3.set_title("Cell Signaling Simulation") ax3.set_xlabel("Time") ax3.set_ylabel("Activity") ax3.legend() ax4 = plt.subplot(2, 2, 4) ax4.plot(met_trace) ax4.set_title("Metabolic Search Objective Trace") ax4.set_xlabel("Iteration") ax4.set_ylabel("Objective score") plt.tight_layout() plt.show() G_grn = nx.DiGraph() for g in genes: G_grn.add_node(g) for i in range(len(genes)): for j in range(len(genes)): if abs(W[i, j]) > 0.4: G_grn.add_edge(genes[i], genes[j], weight=W[i, j]) plt.figure(figsize=(10, 8)) pos = nx.spring_layout(G_grn, seed=42) edge_colors = ["green" if G_grn[u][v]["weight"] > 0 else "red" for u, v in G_grn.edges()] nx.draw_networkx(G_grn, pos, with_labels=True, node_size=900, font_size=9, arrows=True, edge_color=edge_colors) plt.title("Gene Regulatory Network Graph (green=activation, red=repression)") plt.axis("off") plt.show() top_ppi = ppi_result.summary["top_predicted_interactions"][:12] G_ppi = nx.Graph() for row in top_ppi: a, b, p = row["protein_a"], row["protein_b"], row["pred_prob"] G_ppi.add_edge(a, b, weight=p) plt.figure(figsize=(10, 8)) pos = nx.spring_layout(G_ppi, seed=7) widths = [2 + 4 * G_ppi[u][v]["weight"] for u, v in G_ppi.edges()] nx.draw_networkx(G_ppi, pos, with_labels=True, node_size=1000, font_size=9, width=widths) plt.title("Top Predicted Protein Interaction Subnetwork") plt.axis("off") plt.show() grn_table = pd.DataFrame(grn_result.summary["most_dynamic_genes"]) ppi_table = pd.DataFrame(ppi_result.summary["top_predicted_interactions"]) met_table = pd.DataFrame(met_result.summary["dominant_reactions"]) sig_table = pd.DataFrame([sig_result.summary]) pretty("Most Dynamic Genes", grn_table.to_string(index=False)) pretty("Top Predicted PPIs", ppi_table.to_string(index=False)) pretty("Dominant Metabolic Reactions", met_table.to_string(index=False)) pi_agent = PrincipalInvestigatorAgent(client=client, model=OPENAI_MODEL) final_report = pi_agent.synthesize(all_results) pretty("OPENAI SYSTEMS BIOLOGY REPORT", final_report) artifact = { "grn": grn_result.summary, "ppi": ppi_result.summary, "metabolic": met_result.summary, "signaling": sig_result.summary, "llm_report": final_report, } with open("bio_agents_tutorial_results.json", "w") as f: json.dump(artifact, f, indent=2) print("\nSaved results to: bio_agents_tutorial_results.json") We define the cell signaling simulation agent and the principal investigator agent, and then execute the complete end-to-end workflow. We run all four biological modules, print structured outputs, generate plots and network visualizations, build tidy summary tables, and finally use the OpenAI model to write an expert-style report that integrates the findings across all subsystems. We bring everything together into a complete pipeline for biological systems modeling. It shows how multi-agent AI can support scientific interpretation, visualization, and hypothesis generation. In conclusion, we created a complete computational biology workflow that demonstrates how agent-based AI can be used to study multiple layers of biological organization in a structured and interpretable way. We moved from data generation to modeling, optimization, simulation, visualization, and final scientific synthesis, which helps us see how specialized agents can collaborate to produce richer biological insight than any single isolated analysis. At the end, we have a strong foundation for extending this notebook toward more realistic omics datasets, experimental priors, mechanistic constraints, and deeper biological hypothesis generation for advanced systems biology research. Check out the Full Codes with Notebook . Also, feel free to follow us on Twitter and don’t forget to join our 130k+ ML SubReddit and Subscribe to our Newsletter . Wait! are you on telegram? now you can join us on telegram as well. Need to partner with us for promoting your GitHub Repo OR Hugging Face Page OR Product Release OR Webinar etc.? Connect with us The post Build a Multi-Agent AI Workflow for Biological Network Modeling, Protein Interactions, Metabolism, and Cell Signaling Simulation appeared first on MarkTechPost .