Loading…
PEARC20 has ended
Welcome to PEARC20!
PEARC20’s theme is “Catch the Wave.” This year’s theme embodies the spirit of the community’s drive to stay on pace and in front of all the new waves in technology, analytics, and a globally connected and diverse workforce. We look forward to this year’s PEARC20 virtual meeting, where we can share scientific discovery and craft the future infrastructure.

The conference will be held in Pacific Time (PT) and the times listed below are in Pacific Time.

The connection information for all PEARC20 workshops, tutorials, plenaries, track presentations, BOFs, Posters, Visualization Showcase, and other affiliated events, are in the PEARC20 virtual conference platform, Brella. If you have issues joining Brella, please email pearcinfo@googlegroups.com.
Tuesday, July 28 • 1:35pm - 2:35pm
Log Discovery for Troubleshooting Open Distributed Systems with TLQ

Sign up or log in to save this to your schedule, view media, leave feedback and see who's attending!

Troubleshooting a distributed system can be incredibly difficult. It is rarely feasible to expect a user to know the fine-grained interactions between their system and the environment configuration of each machine used in the system. Because of this, work can grind to a halt when a seemingly trivial detail changes. To address this,there is a plethora of state-of-the-art log analysis tools, debuggers,and visualization suites. However, a user may be executing in an open distributed system where the placement of their components are not known before runtime. This makes the process of tracking debug logs almost as difficult as troubleshooting the failures these logs have recorded because the location of those logs is usually not transparent to the user (and by association the troubleshooting tools they are using). We present TLQ, a framework designed from first principles for log discovery to enable troubleshooting of open distributed systems. TLQ consists of a querying client and a set of servers which track relevant debug logs spread across an open distributed system. Through a series of examples, we demonstrate how TLQ enables users to discover the locations of their system's debug logs and in turn use well-defined troubleshooting tools upon those logs in a distributed fashion. Both of these tasks were previously impractical to ask of an open distributed system without significant a priori knowledge. We also concretely verify TLQ's effectiveness by way of a production system: a biodiversity scientific workflow. We note the potential storage and performance overheads of TLQ compared to a centralized, closed system approach.


Tuesday July 28, 2020 1:35pm - 2:35pm PDT
Brella