this post was submitted on 02 Apr 2025
30 points (96.9% liked)

Programming

19297 readers
58 users here now

Welcome to the main community in programming.dev! Feel free to post anything relating to programming here!

Cross posting is strongly encouraged in the instance. If you feel your post or another person's post makes sense in another community cross post into it.

Hope you enjoy the instance!

Rules

Rules

  • Follow the programming.dev instance rules
  • Keep content related to programming in some way
  • If you're posting long videos try to add in some form of tldr for those who don't want to watch videos

Wormhole

Follow the wormhole through a path of communities [email protected]



founded 2 years ago
MODERATORS
 

cross-posted from: https://lemmy.sdf.org/post/31995242

Archived

Unveiling Trae: ByteDance's AI IDE and Its Extensive Data Collection System

Trae - the coding assistant of China's ByteDance - has rapidly emerged as a formidable competitor to established AI coding assistants like Cursor and GitHub Copilot. Its main selling point? It's completely free - offering Claude 3.7 Sonnet and GPT-4o without any subscription fees. Unit 221B's technical analysis, using network traffic interception, binary analysis, and runtime monitoring, has identified a sophisticated telemetry framework that continuously transmits data to multiple ByteDance servers. From a cybersecurity perspective, this represents a complex data collection operation with significant security and privacy implications.

[...]

Key Findings:

  • Persistent connections to minimum 5 unique ByteDance domains, creating multiple data transmission vectors
  • Continuous telemetry transmission even during idle periods, indicating an always-on monitoring system
  • Regular update checks and configuration pulls from ByteDance servers, allowing for dynamic control
  • Permanent device identification via machineId parameter, which appears to be derived from hardware identifiers, enabling long-term tracking capabilities
  • Local WebSocket channels observed collecting full file content, with portions potentially transmitted to remote servers
  • Complex local microservice architecture with redundant pathways for code data, suggesting a deliberate system design
  • JWT tokens and authentication data observed in multiple communication channels, presenting potential credential exposure concerns
  • Use of binary MessagePack format observed in data transfers, adding complexity to security analysis
  • Extensive behavioral tracking mechanisms capable of building detailed user activity profiles
  • Sophisticated data segregation across multiple endpoints, consistent with enterprise-grade telemetry systems

[...]

all 4 comments
sorted by: hot top controversial new old
[–] [email protected] 7 points 1 day ago* (last edited 1 day ago) (2 children)

Some of these key findings seem a bit overblown. The number of domains persistently connected to shouldn't really matter - one is enough. Update checks are standard for software. Unique IDs/device fingerprinting are so common that browsers build in ways to try to prevent it at scale. JWTs are standard authentication tools - who's the security concern for? ByteDance? Or are you saying the JWTs are from the local machine? And MessagePack isn't exactly a secret format either.

The TL;DR of this seems to be that ByteDance's AI IDE collects a crazy amount of data and offers free AI services in exchange. I'm not really sure why you'd want those services, especially at the cost of all your code potentially being stolen or other data being collected, but it should be obvious that nothing in this world is truly free.

[–] [email protected] 1 points 1 day ago

If your code is open source anyway, there might be a reason to use their free services.

[–] [email protected] 1 points 1 day ago

JWTs are standard authentication tools - who’s the security concern for? ByteDance? Or are you saying the JWTs are from the local machine?

Yes, I read that as local project JWTs are being transmitted to their servers. As a concern, and not labeled as used for authentication, IMO it's clearly implied that they observed JWT tokens and auth data unrelated to any telemetry auth (if they even have any).

JWT tokens and authentication data observed in multiple communication channels, presenting potential credential exposure concerns