Challenges managing large-scale wireless networks Lakshminarayanan Subramanian Courant Institute of Mathematical Sciences New York University Joint work with many others Management Complexity Ladder Indoor Wireless Access Point Network Multi-hop indoor wireless networks Outdoor Mesh Networks Rural Wireless Networks (Long-distance +mesh) 2 Why is management hard? Potential causes for performance degradation External problems Network performance issues Radio separation issues Unreliable power (Huge problem in rural wireless) Software issues + Configuration Incorrect ETX, Channel assignment. Routing problems Physical issues Interference, Channel fluctuations Forwarding problems, unexpected packet drops Mundane problems Loose pigtail, Card misbehavior, Card stops working 3 Why is it hard to fix? Potential causes are huge and interdependent No back channels (in multi-hop cases) Measurements vary by the second Environmental fluctuations Power fluctuations Software behavior on wireless boards is not very predictable Climbing street poles and towers is actually not fun! 4 Some experiences ROMA: Multi-radio indoor wireless network (Aditya, Jinyang) CitySense: Outdoor wireless mesh network (Matt T, Matt W) WiLDNet: Long-distance WiFi networks (Rabin, Sergiu, Sonesh, Eric, Manuel) WiRE architecture (Matt T, Aditya) 5 Multi-radio mesh promises greater throughput gateway Eliminate Intra-path interference Cannot transmit concurrently gateway gateway Cannot transmit concurrently Reduce Inter-path interference Physical constraints Compact nodes few radios per node Link losses, link variability, external load Link variabilities Two radios report with diff channel conditions ETX measurements are skewed Channel 1 works very poorly, channel 11 works well!!! ROMA: basic idea <C1> Single-radio gateway C1 <C1,C2> <C1,C2> C2 <C2,C3> <C2,C3> <C2,C3> C3 <C3,C4> <C3,C4> Each radio in a multi-radio gateway acts as an independent gateway Routing metric must consider worst link 1 2 Single-radio route metric: 1 ETT i 1 2 Path throughput is limited by worst link 1 Multi-radio route metric ETT i Link metric ETT over-estimates link performance Link metric should incorporate: Link variability highly variable links result in unpredictable throughput External load ETT = 1 f( ) pa pa : average delivery ratio pv : deviation of delivery ratio L : fraction of time channel is busy with external traffic Conservative 1 metric CETT = f ( p p ) * (1 L) a v Our Indoor Testbed NSC Geode Processors, 128MB RAM, 1GB Flash Implemented on the Click Modular Router Patched Madwifi 0.9.3.3 Aggregate performance ROMA’sismedian aggregate ROMA able to utilize morethroughput channels is 1.4X or inter-path 2.1X of alternative designs to reduce interference 2 identical channels 1 common, 1 assigned channel ROMA Aggregate throughput (Mbps) Setup : 9 UDP flows from 3 gateways to non-gateway nodes WiFi-based Long Distance Networks WiLD links use standard 802.11 radios Longer range up to 150km Directional antennas (24dBi) Line of Sight (LOS) Why choose WiFi: Low cost of $500/node Volume manufacturing No spectrum costs Customizable using open-source drivers Good datarates 11Mbps (11b), 54Mbps (11g) 14 AirJaldi Network • • • • • • Tibetan Community WiLD links + APs Links 10 – 40 Kms Achieve 4 – 5 Mbps VoIP + Internet 10,000 users Routers used: (a) Linksys WRT54GL, (b) PC Engines Wrap Boards, Costs: (a) $50, (b) $140 15 Aravind Eye Hospital Network • • • • • • • South India Tele-ophthalmology All WiLD links Links 1 – 15 Kms long Achieve 4 – 5 Mbps Video-conferencing 3000 consultations/month Routers used: PC Engines Wrap boards, 266 Mhz CPU, 512 MB Cost: $140 16 New World Record – 382 Kms Pico El Aguila, Venezuela Elev: 4200 meters 17 Deployment 18 Overall Impact Both networks financially sustainable 50000 patients/year being scaled to 500000 patients/year Over 30000 patients have recovered sight 19 Experience with WiLD Networks In the field, point-to-point performance is bad On a 60km link in Ghana We get 0.6 Mbps TCP vs 6 Mbps UDP On a relay (single channel) We get only 2 Mbps TCP 20 Problem: Propagation Delay Large propagation delay high collision probability A B 21 Design Choices for WiLDNet Use Sliding Window flow control 802.11 MAC ACKs disabled Packet batches sent every slot Slot allocation determined by demand Replace CSMA with TDMA on every link Alternate send and receive slots 22 Inter-Link Interference Simultaneous Send Simultaneous Receive Send & Receive B B B 1 1 1 A A A 2 2 2 C C Disable CCA C 12dB isolation 23 Channel Loss: From external traffic Strong correlation between loss and external traffic Source (A) and interferer (I) do not hear each other A I B 24 Sustainability Challenges Bad quality grid power Limited local expertise Local operation, maintenance, and diagnosis difficult Lack of alternate connectivity Higher component failures, more downtimes Complicates remote diagnosis and management Remote locations Traveling is difficult and infrequent (often once in 6 months) 25 26 27 Voltage Range Poor Quality Power Number of Instances seen over 6 weeks Spikes and Swells: Low Voltages: • Lost 50 power adapters • Incomplete boots • Burned 30 PoE • HW watchdog Frequent Fluctuations: • CF corruptions • Battery Damage 28 HW Faults Hardware Faults at Aravind (in 2006) Instances* Description Total Downtime 63 Router board not powered 63 days 7 Router powered but hung 10 days 21 Router powered but not connected to remote LAN (burned ethernet ports) 34 days 3 Router on, but wireless cards not transmitting (low voltage) 2 days 3 Router on, but pigtails not connected 45 days 1 Router on, but antenna Line-of-Sight blocked 8 weeks *Conservative Estimate >90% of faults are power-related 29 SW Faults Software Faults at Aravind (in 2006) Instances* Description Total Downtime 4 No default gateway specified 4 days 3 Wrong ESSID, channel, mode 3 days 2 Wrong IP address 2 days 2 Misconfigured routing 3 days *Conservative Estimate 30 Solutions 1. Power 1.1 Low Voltage Disconnect 1.2 Low-cost Solar Power Controller 2. Data Collection and Monitoring 3. Alternate Network Entry Points 4. Recovery Mechanisms 5. Safe Software 31 Power: Low Voltage Disconnect Low Voltage Disconnect Circuit (LVD) Disconnect load at low voltage Prevent battery over-discharge and hung routers Without LVDs, roughly 50 visits per week for manual reboots at AirJaldi Off-the-shelf LVDs oscillate too much Too many automatic reboots We designed new LVD circuit with better delay No more manual visits or reboots! 32 Power: Low-cost Solar Power Controller Tackle spikes, swells and enable power at remote sites Features PPT (peak power tracking) => 15% more power draw LVD + trickle charging => Doubles battery life Voltage regulator => No spikes and swells Power-over-Ethernet => Remote Mgmt $70 (compared to $300 commercial units) Have not lost any routers yet in 1 year 33 Operational Results Fault Incident Counts 60 Before: Jan 07 - Jun 07 50 After: Jul 07 - Dec 07 Count 40 30 20 10 0 Weekly Manual Reboots (AirJaldi) Number of Prolonged Power-related Downtimes Router Faults greater than (Aravind) 1day (Aravind) Incidents CF Card Corruptions (Aravind) 34 Operational Results Our support Migration at Aravind Aravind Local Vendor Maintenance Management Installation Equipment Supply Jan’06 – Jun’06 Jul’06 – Dec’06 Jan’07 – Jun’07 Jun’07 – Dec’07 2007: 5 more 35 WiRE Architecture 36 Questions? Thank you!