Challenges managing large-scale wireless networks

advertisement
Challenges managing large-scale
wireless networks
Lakshminarayanan Subramanian
Courant Institute of Mathematical Sciences
New York University
Joint work with many others
Management Complexity Ladder

Indoor Wireless Access Point Network

Multi-hop indoor wireless networks

Outdoor Mesh Networks

Rural Wireless Networks (Long-distance +mesh)
2
Why is management hard?

Potential causes for performance degradation

External problems


Network performance issues



Radio separation issues
Unreliable power (Huge problem in rural wireless)
Software issues + Configuration


Incorrect ETX, Channel assignment. Routing problems
Physical issues


Interference, Channel fluctuations
Forwarding problems, unexpected packet drops
Mundane problems

Loose pigtail, Card misbehavior, Card stops working
3
Why is it hard to fix?







Potential causes are huge and interdependent
No back channels (in multi-hop cases)
Measurements vary by the second
Environmental fluctuations
Power fluctuations
Software behavior on wireless boards is not very
predictable
Climbing street poles and towers is actually not fun!
4
Some experiences

ROMA: Multi-radio indoor wireless network (Aditya, Jinyang)

CitySense: Outdoor wireless mesh network (Matt T, Matt W)

WiLDNet: Long-distance WiFi networks (Rabin, Sergiu, Sonesh,
Eric, Manuel)

WiRE architecture (Matt T, Aditya)
5
Multi-radio mesh promises greater
throughput
gateway
Eliminate
Intra-path interference
Cannot transmit concurrently
gateway
gateway
Cannot transmit concurrently
Reduce
Inter-path interference
Physical constraints

Compact nodes  few radios per node

Link losses, link variability, external load
Link variabilities



Two radios report with diff channel conditions
ETX measurements are skewed
Channel 1 works very poorly, channel 11 works well!!!
ROMA: basic idea
<C1>
Single-radio
gateway
C1
<C1,C2>
<C1,C2>
C2
<C2,C3>
<C2,C3>
<C2,C3>
C3
<C3,C4>

<C3,C4>
Each radio in a multi-radio gateway acts as an independent
gateway
Routing metric must consider
worst link
1
2
Single-radio
route metric:
1
 ETT
i

1
2
Path throughput is
limited by worst link
1
Multi-radio
route metric

 ETT
i
Link metric


ETT over-estimates link performance
Link metric should incorporate:

Link variability


highly variable links result in unpredictable throughput
External load
ETT =
1
f(
)
pa 
pa : average delivery ratio
pv : deviation of delivery ratio
L : fraction of time channel is
busy with external traffic
Conservative
1 
metric CETT = f ( p  p ) * (1 L)
a
v

Our Indoor Testbed



NSC Geode Processors, 128MB RAM, 1GB Flash
Implemented on the Click Modular Router
Patched Madwifi 0.9.3.3
Aggregate performance
ROMA’sismedian
aggregate
ROMA
able to utilize
morethroughput
channels
is 1.4X
or inter-path
2.1X of alternative
designs
to
reduce
interference
2 identical channels
1 common, 1 assigned channel
ROMA
Aggregate throughput (Mbps)
 Setup
: 9 UDP flows from 3 gateways to non-gateway nodes
WiFi-based Long Distance Networks


WiLD links use standard 802.11
radios
Longer range up to 150km



Directional antennas (24dBi)
Line of Sight (LOS)
Why choose WiFi:

Low cost of $500/node




Volume manufacturing
No spectrum costs
Customizable using open-source
drivers
Good datarates

11Mbps (11b), 54Mbps (11g)
14
AirJaldi Network
•
•
•
•
•
•
Tibetan Community
WiLD links + APs
Links 10 – 40 Kms
Achieve 4 – 5 Mbps
VoIP + Internet
10,000 users
Routers used: (a) Linksys
WRT54GL, (b) PC Engines Wrap
Boards,
Costs: (a) $50, (b) $140
15
Aravind Eye Hospital Network
•
•
•
•
•
•
•
South India
Tele-ophthalmology
All WiLD links
Links 1 – 15 Kms long
Achieve 4 – 5 Mbps
Video-conferencing
3000 consultations/month
Routers used: PC Engines Wrap
boards, 266 Mhz CPU, 512 MB
Cost: $140
16
New World Record – 382
Kms
Pico El Aguila, Venezuela
Elev: 4200 meters
17
Deployment
18
Overall Impact



Both networks financially sustainable
50000 patients/year being scaled to 500000
patients/year
Over 30000 patients have recovered sight
19
Experience with WiLD Networks


In the field, point-to-point performance is bad
On a 60km link in Ghana


We get 0.6 Mbps TCP vs 6 Mbps UDP
On a relay (single channel)

We get only 2 Mbps TCP
20
Problem: Propagation Delay

Large propagation delay  high collision probability
A
B
21
Design Choices for WiLDNet

Use Sliding Window flow control




802.11 MAC ACKs disabled
Packet batches sent every slot
Slot allocation determined by demand
Replace CSMA with TDMA on every link

Alternate send and receive slots
22
Inter-Link Interference
Simultaneous Send
Simultaneous Receive
Send & Receive
B
B
B
1
1
1
A
A
A
2
2
2
C
C

Disable CCA

C
12dB isolation
23
Channel Loss: From external traffic

Strong correlation between loss and external traffic

Source (A) and interferer (I) do not hear each other
A
I
B
24
Sustainability Challenges

Bad quality grid power


Limited local expertise


Local operation, maintenance, and diagnosis difficult
Lack of alternate connectivity


Higher component failures, more downtimes
Complicates remote diagnosis and management
Remote locations

Traveling is difficult and infrequent (often once in 6
months)
25
26
27
Voltage
Range
Poor Quality Power
Number of Instances seen over
6 weeks
Spikes and Swells:
Low Voltages:
• Lost 50 power
adapters
• Incomplete
boots
• Burned 30 PoE
• HW watchdog
Frequent
Fluctuations:
• CF corruptions
• Battery Damage
28
HW Faults
Hardware Faults at Aravind (in
2006)
Instances*
Description
Total Downtime
63
Router board not powered
63 days
7
Router powered but hung
10 days
21
Router powered but not connected to
remote LAN (burned ethernet ports)
34 days
3
Router on, but wireless cards not
transmitting (low voltage)
2 days
3
Router on, but pigtails not connected
45 days
1
Router on, but antenna Line-of-Sight
blocked
8 weeks
*Conservative
Estimate
>90% of faults
are power-related
29
SW Faults
Software Faults at Aravind (in
2006)
Instances*
Description
Total Downtime
4
No default gateway specified
4 days
3
Wrong ESSID, channel, mode
3 days
2
Wrong IP address
2 days
2
Misconfigured routing
3 days
*Conservative
Estimate
30
Solutions
1.
Power
1.1 Low Voltage Disconnect
1.2 Low-cost Solar Power Controller
2.
Data Collection and Monitoring
3.
Alternate Network Entry Points
4.
Recovery Mechanisms
5.
Safe Software
31
Power: Low Voltage Disconnect

Low Voltage Disconnect Circuit (LVD)

Disconnect load at low voltage

Prevent battery over-discharge and hung routers

Without LVDs, roughly 50 visits per week for manual
reboots at AirJaldi

Off-the-shelf LVDs oscillate too much


Too many automatic reboots
We designed new LVD circuit with better delay

No more manual visits or reboots!
32
Power: Low-cost Solar Power Controller

Tackle spikes, swells and enable power at remote
sites

Features






PPT (peak power tracking) => 15% more power draw
LVD + trickle charging => Doubles battery life
Voltage regulator => No spikes and swells
Power-over-Ethernet => Remote Mgmt
$70 (compared to $300 commercial units)
Have not lost any routers yet in 1 year
33
Operational Results
Fault Incident Counts
60
Before: Jan 07 - Jun 07
50
After: Jul 07 - Dec 07
Count
40
30
20
10
0
Weekly
Manual
Reboots
(AirJaldi)
Number of
Prolonged
Power-related
Downtimes
Router Faults greater than
(Aravind)
1day (Aravind)
Incidents
CF Card
Corruptions
(Aravind)
34
Operational Results
Our support
Migration at Aravind
Aravind
Local Vendor
Maintenance
Management
Installation
Equipment
Supply
Jan’06 –
Jun’06
Jul’06 –
Dec’06
Jan’07 –
Jun’07
Jun’07 –
Dec’07
2007: 5 more
35
WiRE Architecture
36
Questions?
Thank you!
Download