Experimenting Cumulus Linux

This post was originally posted as Twitter thread 4.3.2021.

Last few weeks I’ve tried to dive into Cumulus Linux and test how it is configured and operated. The goal is to make a tutorial for Cisco network engineers.

Here’s a thread what I’ve noticed during my experimentation.

There used to be over 130 different hardware platforms supported. After Nvidia merger Broadcom switches dropped off last fall. Now hardware is Mellanox Spectrum only which limits options radically. Good thing is that Cumulus and Mellanox come from the same house nicely bundled.

Cumulus 4.2 is the last version to support Broadcom chips. Cumulus is now at 4.3 version so existing Broadcom switches don’t have many years left to run. Long-term ESR releases have 3 years of support time before EOL.

Hardware affects many advanced configurations like ACLs, QoS, DDoS protection, TCAM and HW profiles. There are some notable differences between Broadcom and Mellanox. You can also run Cumulus on any common server.

Cumulus VX virtual appliance is freely available for testing but not intended to be run in production. I ran two VMs on my Windows laptop using Virtualbox. You can also test Cumulus in the cloud with full fabric setup and NetQ: https://air.cumulusnetworks.com/.

Cumulus is Debian-based Linux and it can be configured like native Linux or through NCLU command wrapper. NCLU is good for most of the configurations and familiar way for network engineers. There are additional Python-based cl-tools to operate certain features.

NCLU is like poor man’s Junos, simple but feature-rich enough. Similar Unix background shines through. NCLU has tab completion, built-in help with config examples, commit and rollback with version control.

NCLU uses add/del commands to edit config. Commands are reorganized into config file stanzas in a different format. NCLU can show config as add statements and this snippet can be copied and pasted to notepad or another device. Config can be viewed as json also.

You can mix and match native Linux and NCLU as you will. This is a very powerful option to make your own style configuration and workflow. Operational commands like ping, traceroute and tcpdump are native Linux tools. You can also install more Linux tools of your choice.

Commit is great feature but there were some dependencies that prevented entering all commands at the same time. So I had to configure something and commit it before I could add more config. You can check pending config and commit shows diff and entered commands automatically.

Commit confirmed is strange. Prompt waits for enter to confirm commit forever. You can’t do anything else like checks and validations on CLI meanwhile. Confirmation prompt doesn’t disappear although the timer has run out and you don’t know when the config has rolled back.

Versioning rolls 30 files and allows permanent configs. With rollback it’s a bit hard to find the proper version and check what changes have been made earlier. You can comment on commits and standard Linux diff shows changes. Rollback commits config automatically.

Documentation is very good, simple and clearly organized. And it’s open without login. Some more advanced features are missing detailed explanations and instructions. But you can always Google common Linux guides.

Interface naming is always swp1-swpX. Mgmt port eth0 is dedicated to own MGMT VRF. You can use IF classes for sharing common functions. There are some minor syntax inconsistencies between configuring physical interfaces, bonds and vlans.

I tested basic L2 and vlan routing. More scalable one-instance vlan-aware bridge is the way to go if you don’t need traditional per-vlan spanning-tree. Mac-table aging time is unusual 1800s, arp timeout is 18 min.

Vlan VRRP config between two devices was a bit quirky. I couldn’t commit priority and preempt commands. FRR is used for routing protocols and Zebra programs routes in the kernel. VXLAN routing and PIM-SM are supported, SR and GRE are in early access.

ACLs use Linux Netfilter and supports only input and forward chain filtering and mangling. ACLs can be configured using NCLU, iptables or cl-acltool. NCLU ACL editor has row numbers like Cisco. Default policy is allow. Cl-acltool took a very long time to collect ACL rules.

ACLs can include policers, SPAN/ERSPAN, log and QoS marking actions. Logging was lacking as far as I tried. Logging is supported only for drop action anyway. There is a predefined control-plane protection policer in place. DDoS checker is built-in but but not support on Mellanox.

ACLs with Netfilter/iptables are one of the hardest features to adopt. QoS was even harder and too confusing for me to try and understand.

PTM is a nice tool for LLDP topology monitoring. It can detect anomalies and take scripted actions. Note that LLDP is not realtime because it has long holdtime. You can also visualize .dot topology file in a graph. Logging uses standard Linux files and facilities.

Mellanox switches have What Just Happened feature to collect streaming telemetry data from ASIC. Asic-monitor feature can collect hardware stats and snapshots to files.

Package upgrade between minor versions can be made using apt-get directly from the public repo. This is very handy. Image install is a hard way because it wipes the whole file system and you have to backup and restore configuration files manually.

With ONIE bootloader you can download and install Cumulus using DHCP and web server. USB install is also possible.

NetQ is one of the best mgmt tools I’ve seen, simple but powerful. Besides basic device monitoring and config templates, it can show topology and WJH telemetry data, monitor connectivity traces and validations of IF, protocol and HW parameters, and alert thresholds violations.

Cumulus offers many great features and tools for powerful network operations of your choice. But the disaggregation model has its weaknesses like hardware dependencies and inconsistencies. Broadcom incident casts uncertainty to the whole disaggregation.

Cumulus would be best for use cases where a basic feature set is enough. It is then easy to configure, operate and maintain. Additional monitoring capabilities and tools make Cumulus even more compelling.

Leave a Reply