Computing the Unmeasured: An Algebraic Approach to Internet Mapping
Shavitt, Sun, Wool, Yener. INFOCOM '01
- Offshoot of IDMaps Project (Jamin et al, Michigan)
- Importance of distance estimation (Application: selecting from a set of
equivalent, distributed services; CDNs; others?)
- "Tracers" deployed at key locations.
- Periodic measurements of distance metrics from these nodes to other
"Tracers", other endpoints.
- Distance information relayed to clients, typically other servers (SONAR,
HOPS) which then build an estimated distance map.
- Clients of these servers then triangulate to estimate relevant distances
for themselves.
- traceroute: radiating ICMP probes to learn IP hops and RTT.
- ping: end-to-end probes to measure RTT.
- Previous Work on IDMaps (worth reading)
- INFOCOM '99: Architecture, motivation.
- INFOCOM '00: Heuristics for placement of "tracers". Relationship
to algorithms for solving k-center problems.
- More heuristics for placement in other INFOCOM '00, INFOCOM '01 papers.
- Plenty of related work (much of it "algebraic") on Internet measurement.
- For example: inference from end-to-end measurements.
- Multicast: MINC project at U. Mass.
- Unicast: One aspect of Mass servers project here at B.U.
- Goals of this paper.
- Extract maximal set of measurement information from correlated
pairwise measurements.
- In so doing, attempt to handle noisy data (much more attention
could be paid to this point).
- Handle unidirectional vs. bidirectional measurements.
- Definitions and Basic Approach:
- On physical topology, overlay set of measurement paths ending at tracers.
Paths intersect at crossing points.
- Induced topology is a graph where crossing points, tracers are vertices,
edges are called segments.
- When measurements are correct, and measurements sum, they form a
system of linear equations, where the values on the segments are the
unknowns.
- (Equation matrix induced by the constraints is binary.)
- Soluble via Gaussian elimination.
- What is interesting about this?
- What if system is under-specified?
- What if system is over-specified (and insoluble)?
- What if measurements are inexact (very likely)?
- What if metric of interest does not sum?
- How do we compute the minimal set of measurements needed to obtain
a fully specified system of equations over all segments?
- Noisy Measurements (IV.D)
- Case 1: 2-sided error w/unknown distribution on the noise.
- Technique: Minimize least-squares error on rows of the
constraint matrix (Formula 8). How?
- Case 2: 1-sided error w/unknown distribution on the noise
(i.e. measured delay is always at least as large as the actual
propagation delay).
- Technique: Compute maximizing lower envelope.
- One formulation: Frame as a linear program where the goal is to
minimize the maximum discrepancy on a per-row basis.
- Not necessarily the way I would formulate it...
- What is system is underspecified? (IV.E)
- Any solved variables are in a transformed system of equations.
- Need to transform back to the ``segment'' domain.
- Can check whether it is possible to solve for a variable efficiently.
- My guess: probably not often useful (not analyzed in the paper).
- What if metric of interest does not sum?
- How do we compute the minimal set of measurements needed to obtain
a fully specified system of equations over all segments?
- Implementation details
- Node disambiguation. (A problem we have run into frequently
in our research.) Nodes may have multiple NICs.
- Their solution: DNS queries resolve 94% of IP addresses and
DNS names are assumed unique. Really?
- All nodes in a "cluster of routers" in the same physical space
should be considered equivalent for mapping purposes.
- Ad-hoc solutions (naming conventions, public maps). Other ideas?
- Performance
- Qualitative: "Interesting" subpaths were measured.
- Quantitative (Fig 4): On experimental data, discovered nodes grows
like 2N, where N is the number of tracers. Why? Note that number
of measurements grows quadratically.
- No a priori control over which subpath variables will be soluble.
Or are there? Interesting direction...
- Synthetic Results
- Generated topologies. Waxman and power-law connection models.
- Shortest-path routing, random Tracer placement.
- "Random" delay on edges (not clear how chosen).
- "Measured" delay deviated from actual delay by multiplicative noise
interval.
- Vast differences between Waxman/PL results. Why?
- Claim that PL fits Internet data better.
- Suspicious Figure 5. Why do points lie on the line? Is there a
better way to present this information?
- What experiments would you design to demonstrate the merits of this
system?