Tracing (BSDCam 2017 Working Group)
Thursday, August 3, 2017
Agenda
- Black box
- DTrace
- KTR, etc.
- ptrace, used by truss
- HW tracing
Black Box
- Traces TCP events
- Dumps asynchronously to user space
- High performance
Allows to sample on a per-socket basis (i.e. all events on 1 out of n connections)
Can dump all events or just the last n events
- Questions:
- rwatson@ wonders whether we could combine this with siftr
- rwatson@ wonders whether dtrace could do the same thing with a custom provider
- At minimum: rwatson@ would like to:
- Provide just a single high-performance TCP tracing system (in addition to dtrace)
- Let other mechanisms (such as dtrace) trigger on the same trace points
Dtrace
- Work done recently/in progress:
- Test suite updated
- libxo output
- Reliability/performance enhancements
- Audit provider (dtrace trace points for audit events)
- Portability
- VMs bhyve
- Language extensions
- if
- loops
- modules
- hwpmc interaction
loom (userspace dtrace probe added in LLVM intermediate representation format)
- Open questions
Format: CTF -> DWARF?
- LLVM + CTF?
- rwatson@: motivation not sufficient to make a change
- Security model
- Would be nice to allow non-root users to run selected probes (e.g. trace their own process)
- rwatson@: two problems
- Need to audit byte-code interpreter to ensure its safe
- Need to figure out how to apply appropriate safeguards to ensure the non-root user is only allowed to see events, memory, etc. associated with its own processes
- Potential enhanced providers:
- Enhanced TCP provider
- mbuf provider
- Associate variables, etc. with an mbuf so you can track an mbuf through later probes
- PCAP provider
- Unify KTR and dtrace macros?
- Licensing issue: CDDL?
- Are we happy with the implementation we have? Would we be upset if it goes away? Would we be happier with a reimplementation with a BSD license? And, would we be enough happier to actually invest the years of effort to reimplement it?
- Do we need to keep other tracing tools in case access to dtrace is revoked?
- eBPF as an alternative? But, still licensing questions.
- Effort to extract DTrace bits from FreeBSD into a package consumable by multiple platforms
- Effort to create a specification
- Rewrite compiler in LLVM?
- Brooks suggests we instead teach clang to run ctfconvert
- bz@ suggests that someone update the man pages
- gnn@ and markj@ are already working on it; may involve the doc team
KTR, etc.
- Delayed string/memory expansion
- Pointers evaluated by user space at expansion time; may be invalid by then
- But, this simplicity makes ktr run much more efficiently
- KTR/DTrace unification
- Probe point unification may be hard due to different semantics
- Can we do that in an automatic way?
- Combine where it makes sense (Brooks suggests that ktrace might even be able to use some of the same probes)
- gnn@ suggests that we should have a DTrace/KTR/etc. version of printf() that logs instead of printing
- Works in panic()
- Low overhead
- Leaves a trace in a vmcore
- spare bits
- Make sure all bits are properly documented, and no spare bits overlap assigned bits
- Registry? Central documentation?
- CPU stalls
- Due to shared atomic (for trace buffer?), multiple threads stall waiting on the atomic
- Who owns KTR?
ptrace/truss
- des@ discusses problems with ptrace/truss
- Mechanism is to:
- reparent process being traced
- run waitpid() on process
- Breaks job control
- CTRL-Z doesn't work
- If process forks, the child will be reparented, breaking the child-parent relationship
- Possible solution:
- Store "tracing parent" (as distinct from the "parent")
- Challenges:
- Need to check "tracing parent" in several places: leads to code duplication.
- wait6(), exit1() need to be modified to send information to multiple processes.
- Need to signal both parent and "tracing parent". But, what order?
- Challenges:
- rwatson@ says jhb@ may already have a solution; should collaborate with him
- Store "tracing parent" (as distinct from the "parent")
- pfork() returns file descriptor. Let's say you ptrace a process and end up with multiple file descriptors pointing to the process. You end up with multiple FDs referencing the same process.
- wait6(), exit1() need to be modified to send data to signal consumers
- Is it OK for first process to consume the event and the rest get nothing? Or, do we need to cache the information so multiple things can read it?
- rwatson@ says rusage is the big memory usage
- Information should be cached with the process descriptor structure
- rwatson@ notes that kqueue is level triggered vs. waitpid which is edge-triggered (and has a side-effect: can only be read once and makes the zombie go away)
- Can you cache the information in the process descriptor, only return information once per FD, and make the process descriptor go away once all the FDs are closed?
- rwatson@ proposes pdwait() to handle this
HW Tracing
- HW tracing features on Intel/ARM
- Need to examine sequence of branches examined by user-space code (code coverage, optimization, etc.).
- New tools do this with uninstrumented code (PC Sampling or Last Branch Record).
- A different tool does this using LBR or PT (Intel) or ETM (ARM).
- New tracing isn't sampled (so no missing data). Some modes provide sequence of branches; some actually also provide timing information.
- However, even though the trace isn't sampled, it produces so much data that it is difficult to capture substantial amounts of data.
- ETM timing information: cycle count or global system timer (constant across all cores)
- Linux is working on this.
- ARM decoder will be BSD licensed; ARM driver will be Apache licensed
- Intel code for Linux is probably GPL-licensed
- ETM is global, so more difficult; PT can be per-process
- ETM can capture certain events as part of the branch records
- PT is discoverable via CPUID. To use ETM, it requires the kernel to know more, which requires the vendors to supply the information
- Need to examine sequence of branches examined by user-space code (code coverage, optimization, etc.).
- DynamoRIO ARMv8 support available
- Also supports Intel
- Lets you instrument uninstrumented binaries