Name: Swift distributed tracing method and tools
Start: 2014-05-16T09:00:00-0400
End: 2014-05-16T09:40:00-0400

Back To Schedule

Swift distributed tracing method and tools

This session will include the following subject(s):

Swift distributed tracing method and tools:

Swift is a large scale distributed object store span thousands of nodes across multiple zones and different regions. End to end performance is critical to success of Swift. Methods and tools that aid in understanding the behavior and reasoning about performance issue are invaluable.

Motivation:
1)For a particular client request X, what is the actual route when it is being served by different services? Is there any difference b/w actual route and expected route even we know the access patterns?
2)What is the performance behavior of the server components and third-party services? Which part is slower than expected?
3)How can we quickly diagnose the problem when it breaks at some points ?

Current Implementation:
1)statsD:
a) Designed for cluster level, not for end to end performance.
b) Can not provide metrics data for a set of specific requests.
c) No relationship between different set of metrics for specific transactions or requests.
2)logging:
a) Not designed for real time analysis
b) Require more efforts to collect and analysis
c) No representation for individual span
d) Message size limitation

Can we provide a real time end to end performance tracing/tracking tool in Swift infrastructure for developers and users to facilitate their analysis in development and operation?

Ideas:
Add WSGI middleware and hooks into swift components to collect trace data
Minor fix the current Swift implementation to allow the path to include complete hops.
Analysis tools of report and visualization

(Session proposed by Edward)

Tools/Methodologies for observing/diagnosing swift:

The first step in observing any system behavior is to use a test harness that will allow you to control/vary the load in various ways, something many tools can do for you. However, most tend to be focused on reporting the overall metrics, and pay little attention to what happens over the duration of the test, which led me to write my own (many or which are opensource or could be), which provide the details necessary to help diagnose problems when they occur.

The intent of this talk is to discuss the tools & testing methodologies I use to stress swift and help track down the root causes when problems occur. One of the problems I've been able to uncover were the fact that 1KB PUTS were actually several times slower than 2KB PUTs and once V2.0 of swiftclient was released I was able to quickly identify
a problem with it that caused ALL PUTs to run twice as slow! It's since been fixed.

I've found problems where object servers were taking over a second and by examining fine-grained disk metrics was able to show the problem was with an intermittent high disk latency that had nothing to do with swift but rather a
controller setting!

Currently I'm in the process investigating 2 different problems in both the proxy and object servers where multi-second hangs are occurring under heavy loads. This is something I can easily demonstrate (if not resolved yet) during this session and perhaps as a group can diagnose what is going on in real-time.

(Session proposed by Mark Seger)

Friday May 16, 2014 9:00am - 9:40am EDT
B302

Swift

Juno Design Summit

Attendees (0)

Juno Design Summit

Sign up or log in to save this to your schedule, view media, leave feedback and see who's attending!

Attendees (0)