Langtrain
Langtrain Docs
DocsAPI ReferenceSDK Reference
AppChat
GitHubDiscord

Agentic Computer Vision

Empower your agents to 'see' and interact with the visual world using multi-modal capabilities.

Vision-Language Models
Multi-Modal

What is Langvision?

Langvision is the multi-modal subsystem of the Langtrain ecosystem. It abstracts the complexity of integrating Vision-Language Models (VLMs) like LLaVA, Qwen-VL, and Pixtral into your agentic workflows.

Core Capabilities

  • •UI Understanding: Langvision models are specifically fine-tuned on web and desktop interfaces. They can parse bounding boxes, identify clickable elements, and read text natively from screenshots.
  • •Visual QA: Pass images along with text prompts to ask complex questions about graphs, diagrams, or real-world photographs.
  • •Continuous Streaming: For robotics or screen-recording applications, Langvision can process video frame streams in near real-time using optimized context caching.

Integration in Studio

Using Langvision in Langtrain Studio is as simple as dropping a 'Vision Node' onto your canvas. When an agent requires visual context to complete a task, it can trigger the Vision Node to request a screenshot, parse the current state, and make an informed decision.
Previous
Isolated Compute
Next
Screen Control APIs