Scaling Intel’s Data Centers with Network Automation (Sponsored)

Eric Chou (0:05 – 0:28)

Hello and welcome to the Network Automation Nerds podcast, where we explore the latest in network automation from a practitioner’s perspective. I’m your host, Eric Chou, a network engineer who loves everything about network automation. Today, we’re talking to Greg Botts from Intel, who transformed 5,000 plus network devices across 56 data centers with a small team. Greg started with YAML files and DNS records and ended with a scalable data center design that is the perfect foundation for AI workloads. In the process, he discovered that although not for everyone, sometimes open source solutions are better than commercial ones. This episode is sponsored by Network To Code, and I’m joined today by my co-host, Ethan Banks. Let’s dive in. Welcome to the show, Greg.

Greg Botts (00:29)

Thank you. Thank you for having me.

Eric Chou (0:30 – 1:15)

Yeah, Greg. So before we start, one question that’s kind of burning in my mind, even when we just initially talked, right, like you know this, that give us some context about your role at Intel and what actually drives the massive network infrastructure demand, because Intel to me has always been just manufactured, right? I go to Best Buy, buy my computer, and there’s Intel inside, right? So what’s driving these massive network demands for Intel?

Greg Botts (1:16 – 3:27)

That is a great question, and I think I’m going to answer it kind of backwards. So I’ll explain first kind of how we categorize our data center infrastructures. There’s basically four different kind of disparate infrastructures, and maybe they live in the same data center. Some places, maybe they don’t. We have data centers worldwide. But the four kind of categories, we have an acronym for it. We love acronyms. So our acronym is DOME, D-O-M-E. D is for design, which I’ll get into in a second. O is office, M is manufacturing, which you were talking about, and E is enterprise. So my realm is the D and the E, so design and enterprise. Design is kind of what it sounds like. It’s where all the chip design happens, right? That’s all the software. I have no idea what our customers do. They’re brilliant, and that’s where they’re doing all of the things that go into design, right? Even progression testing, all that kind of stuff. So for us, that’s very heavy compute, very heavy storage. That’s where we have our scale at the data center. They just beat the tar out of our stuff. The E, enterprise, that’s kind of where we get a little more customized, a little more complex. It’s like we host our supply chain systems. We have a lot of on-prem hosting solutions that are actually hosted in the enterprise infrastructure. Tons of web applications, that sort of thing. So again, for us, a lot of customization, a lot of security, a lot of different solutions we have to implement. A little bit more complex, not the scale of the D. So again, for me, I’m kind of a senior network engineer for both those environments. Like you said, that if you total both of those, we run the same platform underneath everything. We’re at like 5,500 network devices. So I do that, and then I’m also on this little small team that helps automate against both the D and the E.

Eric Chou (3:27 – 4:41)

Yeah, I love acronyms too. So I’m glad you broke it down into just D-O-M-E. And I remember in previous live, I was kind of in the same role on just managing, but they’re just so different, right? Because in previous live, we have hyperscaler, the cloud, and the office, the enterprise. And they’re almost diverge at the very beginning. Almost like when you’re writing, there’s like fiction and nonfiction. Because the enterprise tend to be very wide. So they need to do a little bit of wireless. They need to do a little bit of wire networking, firewalls, and all of that. That’s very tailor-made, even at the office level. But the other side where the Hyperscaler that we want, they almost have to be concerned with just standardization. They need to do at massive scale. At scale, everything breaks. So the fundamental needs are almost diverging. So how often do you find yourself just kind of splitting between the two sides and switching hats? And has that been difficult for you? Or that goes into like your background, right? Like how did you come to Intel? And what was your previous experience like?

Greg Botts (4:41 – 5:35)

Right. So I think all total I’ve been here with Intel, better part of 15 years. Oh, wow. I actually grew up, you know, computer science in school. First job right out of college was a Linux sysadmin or Unix sysadmin, which turned into Linux. Having that background, by the way, I think that’s how everyone should start. Whenever anyone wants to get into anything IT, even if you want to be a developer, if you can get a little job as a Linux admin, if they even still have those anymore, that would be my recommendation. Anyway, I had a couple of, you know, I was more on the server side doing little Bash scripting here and there. What was the one that gave you response back? And you had to, gosh, it’s old. I can’t even remember it. Thank you.

Ethan Banks (5:36)

Expect.

Greg Botts (5:36 – 6:22)

Yeah. It was appropriately named. Anyway, so I joined Intel. I still wasn’t even a network guy. I ended up morphing into a network guy and I started on the E side of things. So I kind of grew up in that environment, left Intel for a couple of years, came back and came back more into the D side of it. So now I kind of have both. And where we’re going, and I’m sure we’ll get to this in a little bit, with our new kind of automation system and some of our new standards, we’re trying as much as we can to kind of blend those together. We, you know, staffing is always short, right? And thin. So we really want, you know, the same network engineers working on both environments. And we’ve come a long way in that regard.

Ethan Banks (6:23 – 6:31)

You mentioned customers early on. And I’m assuming your customers, Greg, from your perspective, are internal Intel folks that use your network to do what they do.

Greg Botts (6:31 – 6:51)

Yes. Up until recently, in the D space, now we’re starting to get some external customers. And so we’ve had to kind of put some overlay on that infrastructure so we can isolate workloads and be secure. We don’t personally, right? We’re more on the infrastructure side. So the folks I interact with, yes, 100% internal.

Ethan Banks (6:52 – 6:58)

And then you also said 5,500 network devices. I’m assuming that’s multi-vendor, perhaps with a lot of diversity?

Greg Botts (6:58 – 7:16)

We’re always kind of, it seems like, in between a migration, right? And so we’re at the tail end of getting the last platform out. And those 5,500 are all, you know, varying SKUs, right? But same vendor, same platform. That’s our ideal state.

Ethan Banks (7:17 – 7:19)

Same vendor, same platform with common programmatic interfaces you’re accessing?

Greg Botts (7:19)

Yes. Yes.

Ethan Banks (7:20)

Lucky.

Eric Chou (7:21)

Yeah, I know, right?

Eric Chou (7:26 – 7:28)

You owe me a dollar there, Greg.

Greg Botts (7:29 – 8:06)

That was one of the coolest things when we did transition. You know, it was probably six or seven years ago. So it was before COVID, BC. We started bringing in a new network platform. And for us, the game changer was, you know, yes, they had awesome, cool interfaces and, you know, nice software that went with it. But for us now, every device came with its own API server. And that was, for us, a game changer. The previous platform, I mean, we had automation, but it was a lot of, you know, screen scraping and the automation had to be a lot more complex. So now we’ve got 5,500 APIs out there, which really, like I said, that was a game changer.

Eric Chou (8:07 – 9:37)

Yeah, I want to double click on that because I think that’s kind of the foundation where we move forward with. Because, you know, there’s a great paper that you guys recently published, Scaling DC with SDN. And I want to give a lot of attention to that. If you haven’t read that paper, I think it’s a very honest, it’s a very thorough paper. You know, sometimes I wish when I was reading it, I almost wanted to just click into it, right? Like I want to speak to the person who wrote that part, which happened probably somewhat to you, right? You were the co-author of that paper. And I think that went into a lot of your, you mentioned six year journey from, you know, just the 1.0, the orchestration, the automation, picking the vendor, going through that process of build or buy. So besides just pointing that paper, people who are interested will include that in the show notes. But if you could just walk us through on just the automation part on the very beginning, because, you know, it went through 1.0, 2.0, the design changes, you know, aggregation and interconnecting with, you know, quote unquote classic or legacy devices. That is a huge amount of tasks and that paper actually went through the whole thing. So if you just walk us through just the automation part, because we don’t, we don’t have days to go through that to summarize the six year journey, but just the automation part on what was that like if you take us to the beginning?

Greg Botts (9:38 – 11:46)

Absolutely. So this was the six to seven years ago, the BC, you know, time for network refresh, right? And, and so we had our platform, we had our, our stuff at the same time, right? We were figuring out like, what is our network design going to look like right now? We were going to do a leaf spine design. So we’re figuring out what that looks like. We’ve got a new platform. So it’s a new OS we’re dealing with, you know, and at the same time, hey, we want to automate all that stuff, right? So all of that’s happening at once. In hindsight, that’s, that’s not ideal. It turns out, unless you have a really huge team, but anyway, so, so to start, right, we had to go quickly. So we, we did leverage our vendor had a really slick turnkey solution really around provisioning. And that was awesome. We started with that. It needed to be fed some data. So, you know, this is the start of this evolution that goes, you know, for several years, right? It needs some data. So like you alluded to, we started out literally with a YAML file, and then there was some part of the provisioning process where we had to do kind of a dynamic association between serial number and host name. And so we ended up using DNS text records. And that was like, that was our data, right? That was our version 0.1 of, of a network source of truth, it turns out. So anyway, we start there, right? Our enterprise, you know, it’s working in the lab, right? We start rolling out some enterprise boxes. The system’s working. The design side starts going and that’s the scale side. And once that got going, the D, we, we outgrew that, I mean, in about a week, right? It was just not, not scalable. You know, you’ve got this 10,000 file, you know, 10,000 line YAML file, there’s a syntax, there’s an extra space somewhere in there, right? And you gotta go deal with it. So not scalable. So pretty quickly evolved and took our little YAML file, DNS text record, and put that into a database, a proper database. You know, we had to come with the schema and that sort of thing. Then we needed a render engine, which I call a Rengin.

Eric Chou (11:48 – 11:51)

Did you copyright that or the Rengin?

Greg Botts (11:51 – 14:12)

It’s really like, I couldn’t, I kept saying it too fast. It came out Renjin. So we made this Rengin, which was basically a bunch of Python code that was, we had logic to enumerate the values, right? As we generate config. So like for instance, we kind of have the DHCP server for our ASN numbers. You boot a device, it needs to know what ASN number it gets for BGP. Some of them have to be the same. Some of them have to be different. We had that whole logic, right? That’s all a bunch of Python scripts. We had VXLAN VNI numbers that we wanted to calculate. We had an algorithm. It’s a Python script, you know, feed it your device name. It’ll come back and give you, or your VLANs will come back to your VNI. We had Jinja templates to take all that data through, you know, run it through our templates and the end product is our code. Well, that Rengin needed an API server, right, for us to call as part of the workflow. So now we’ve got this whole workflow. The vendor turnkey solution is still in there, by the way. And this is all this evolution, right? Step by step by step. And I keep saying we. Everything I just described was predominantly one individual, not me. I was involved, but we had just a brilliant guy, grew up as a network engineer, started growing in Python, and just has become a unicorn that you mentioned, right? Now, you know, his horn was developing over all those years. And so now we had a full fledged unicorn, which really enabled us to get there. And then the kind of the cliffhanger that I’ll finish your question with is that all spanned about, I don’t know, four years or so. And then we kind of hit this inflection point. So the vendor turnkey solution, their roadmap was changing. And the component that we leveraged a lot for the provisioning was going away. The, you know, it wasn’t free, right? And the licensing was per device. And now we have a lot of devices. So there’s a there’s a dollar factor in there. And our system had just become very complex, right? That database now had tons of tables and some of them were needed for one thing. Some of them were needed for another. Some of them were for the D, some were for the E. It just wasn’t very approachable for our team of network engineers to kind of participate. And then I always joked, right now we’re one lottery ticket away from being in trouble.

Ethan Banks (14:14 – 14:19)

That is your unicorn goes away. Now you’ve got this homegrown system that no one else, no one else seems to know how to maintain.

Greg Botts (14:20 – 14:24)

We kind of knew, you know, we could manage, but yeah, it would not be ideal.

Eric Chou (14:24 – 15:20)

Yeah. Yeah. That’s, that’s not a winning combination, is it? Like it costs a lot of money and it doesn’t meet your needs. And now it’s going away too. That sounds like a crisis to me, right? Like if we were in the Phoenix project, that’s like the highlight. It’s going through like the hero’s journey where you’re faced with this mountain of challenge that you need to go solve. So what did you, what happened next, right? Like, so I think that’s really the burning question here. So, you know, you have this commercial solution. You have a bunch of Python script, which what you describe as a lot of relationship between different components. When you’re rendering in real time, I bet that’s not very fast or that’s not, you know, all the logic gets bundled into the code. And that’s not very fun to manage. And, you know, you’re relying on this unicorn person that to do manage it. So what happened next? Can you walk us through the next step of your evaluation process?

Greg Botts (15:21 – 15:57)

This was kind of the cool opportunity where we could, you know, step back and it was, and start from scratch, right? That’s so rare, especially in a big infrastructure, but that was our chance. We weren’t now designing our new network infrastructure anymore, right? That was solid and humming. So really we could just focus on automation piece. So it was time to go shopping. And the first thing we bought, we actually got a DevOps guy that I had worked with previously. Fantastic hire, very skilled in DevOps. So now we’ve got, you know, a unicorn, a DevOps guy and half of me. So we’re at two and a half now.

Eric Chou (15:57 – 15:58)

Nice.

Greg Botts (15:58 – 16:22)

Which I don’t necessarily recommend as a number. Really it’s, there’s a lot of boxes you have to have checked for what skills do you need to pull something like this off, right? And it’s a lot of varying boxes to check. And so now between the two and a half of us, all those boxes that we needed were checked. Maybe you can find one person to check them all. Maybe it takes five. We have two incredible folks.

Eric Chou (16:22 – 16:27)

But that person wouldn’t have enough time in the hours in the day to check all those boxes, right?

Greg Botts (16:27 – 17:18)

That is also true. Yes. So now we are going shopping, right? And we’ve got our list of criteria. And a lot of our criteria was focused around supportability. So we wanted off the shelf as much as possible, right? We knew we would probably need some customization because we do some weird stuff, but off the shelf as much as possible. Abstraction was really good to the extent possible. Open standards was a big thing for us. You know, not a black box, right? So if we could find something vendor agnostic, which in the realm of, you know, network automation, a lot of the open source stuff happens to be vendor agnostic. So that was huge. And we wanted a light administrative burden. Previously in our previous system, there was a lot of kind of overhead for sysadmin type work. I found myself going back to my first job out of college, right? And being a Linux admin and all that stuff. So that was our list.

Eric Chou (17:18 – 17:27)

So do you mean the overhead, meaning the learning curve you have for that specific tool, right? Like the unique aspect of that. Is that what you meant?

Greg Botts (17:27 – 17:59)

It was more the hosting part of it, right? So a vendor solution, typically, right? They’ll sell you an appliance, maybe. Or we were kind of rolling our own. So it was a VM, but it wouldn’t work in our hosted VM farm for many reasons. So now I’m running a bunch of KVM servers. And I’m responsible for the hardware, the OS stack, the KVM, which is what I was using, and then putting their stuff on. And then it’s clusters of those. And then it’s our API server and all of the things, all of our workflow, just a hosting burden, really.

Eric Chou (17:59 – 18:19)

Right, right, right. That’s kind of surprising. But I think that makes sense once you explained it, right? Because initially, I was just thinking about the soft part of it. But yeah, a lot of times they stick into this uniqueness that only works with their thing. That combination, that winning combination that we talked about, right? So yeah, that makes a lot of sense.

Greg Botts (18:19 – 19:13)

That was a big item, I guess, on our shopping list, our criteria. So what happened was really cool. We went out to go shopping. And we realized the ecosystem around network automation tooling had just blown up since we last looked. And it was, I mean, it was a little bit overwhelming. But, you know, it’s like, Steinzi, you had him on his map, right? We’re looking at the map and trying to sort through everything, which was fantastic, right? So we had lots of options. Now it was almost too much. So we had to kind of narrow it down. We knew in our old system, that database that our unicorn had, you know, that schema, that was kind of our bread and butter. So we decided we need to start with like a network source of truth. And honestly, it was also attractive, because there weren’t a lot of choices for a network source of truth out in the ecosystem. So it was a little easier to make a decision. It’s like when I go to Costco, I know, you know, if mustard’s on the list, there’s only one mustard.

Eric Chou (19:14 – 19:14)

Right, right.

Greg Botts (19:15 – 19:29)

We wanted that. And then we wanted, you know, in network source of truth, it was like, let’s find one that has as many features as we can get, right? Maybe it would be an extra bonus if it came with a Rengin, for instance. So that was our shopping experience.

Eric Chou (19:29 – 19:44)

I tell you, you got to copyright that term. I imagine from now on, you know, whatever rendering engine I was, I’ll just start quickly to Rengin and think of Greg that first heard it here. Right. I was X years old when I heard it here.

Greg Botts (19:46 – 19:51)

There’s probably a real word and I’m just not up to speed on it. But that’s my that’s what I call it.

Eric Chou (19:51 – 20:21)

Yeah, no, it’s great. So now that you have, you know, you went to Costco Trader Joe’s where you have limited options, right? Like there’s just a few options that’s out there. So what ultimately what tool did you ultimately decide and what was the reason behind it? Because people might think they’re the same. People might think they’re different. And everybody, if you talk to the vendor, they will tell you what was so cool about them. But I want to hear from your perspective, like what was the reason that you ultimately picked and choose Vendor X or Tool X?

Greg Botts (20:21 – 21:10)

Yes. So we did end up open source and we went with a Nautilbot. And really it was, you know, the network source of truth stuff was there. Right. Now our data, you know, all our tables that we were all our schema, like it’s baked in there. It knows that a VLAN can belong to a VIRF and a VIRF can belong to a data center and, you know, et cetera. So that was all there. One of the game changers, right. We also wanted more off the shelf features, you know, as many as possible. And so there’s a, I don’t know if it’s called an app or a plugin. I don’t know the nomenclature, but it’s golden config. Okay. It’s golden config. And that for us, that was something we were going to have to do. Right. That’s like our engine. Right. We’ve got all our data. We’ve got our templates. Like, how do you pass that data through the templates and end up with a config? And that was kind of baked in. So for us, that was win. And that’s where we started.

Eric Chou (21:11 – 23:00)

That’s awesome. Because I remember when I, a lot of times when I read that Kirk Byers blog or, you know, other people’s blog, it’s always like, hey, you know what the best part is? Somebody else already did it for you. It’s like, hey, you know, you don’t have to live through it. You don’t have to burn through the midnight oil. So, and I think that’s what a lot of people ended up with open source tools as well. I mean, you could talk everything about the Linux core that’s open source, but how many times do you compile that core and how many times do you, you know, work on that core, right? No. You take it as this. You trust the open source community, get enough eyeball on it that I trust that it’s going to work. And then, you know, once that core is there, it’s like, okay, what now? Right. It’s all these little tools that’s on top of Linux that’s making it useful as the system admin. I think you would agree with me that it’s not just the Linux core, but it’s all these other tools that was previously established for Unix that’s been ported to Linux. The PWDs, the LS, the CDs, and all these zip and tar, all these other small tools that’s helpful. So, I could see why you ended up choosing a platform that already has a lot of apps that’s baked in. And you’re just like, hey, you know, why not? Let me just take it. But I also wanted to ask that, you know, it seems like going from a vendor solution to an open source solution is quite a big jump. Maybe not from an engineering perspective, but from a management perspective, right? Like, who do I call when there’s an issue? What do I do when I wanted this feature to go into that software? Do I need to go hire people? So, all of these are legitimate concerns. Did you go through that with your management? Or was that just kind of a culture that Intel already have? So, there’s not a big issue there.

Greg Botts (23:00 – 24:06)

We did. And that’s a great question. And we’ve had a ton of support from our management, whether it’s technical management, business side management. And one of the big pros, you know, we weren’t just getting random pieces of software somewhere. We were getting one that did have the option of enterprise support. If we wanted that, if, you know, we were between billing cycles at the time. But, hey, if this is going to look good and we want to scale this, there is that option out there. There is a company sponsoring it, right, that you’re probably familiar with. So, having that as an option was another key piece of the criteria. The other thing is that you worry about with open source is maybe it’s going to die on the vine, right? Or maybe the contribution goes away. So, one of the things that the DevOps guy that we hired, he had this idea when we were kind of shopping and comparing. He looked at contribution history, you know, over time and looked at the repos and looked at the commits. And I thought that was a brilliant idea. And, you know, the one that we ended up going with was definitely on the ramp. So, those two things kind of gave us the confidence, right, that it’s going to be there for the foreseeable future. And if we want or need, you know, that enterprise support, that’s a button we can push.

Eric Chou (24:06 – 25:32)

Yeah, it’s not so much an issue now, but previously when, you know, when that first book came out, people had that question about Python, right? Like, is it going to go away? Is it going to have enough features or enough developers contributing to the features that I could bank on it with when I go to my manager and write thousands of lines of codebase just based on that? And in this case, I think you don’t get a product for popularity, but you get it for the results that popularity brings, right? Like, that brings in attention, brings in money, brings in a lot of other, you know, conferences that people are liking it or like-minded people for bug fixes and so on. So, in this case, I think miscongeniality really tops the number one crown there. You know, popularity really brings a lot of benefits, not just because it’s popular because of all the, you know, surrounding benefits for open source projects. So, it makes sense that you pick one that fits kind of that mode. And I think a lot of open source projects, they’re being successful by having a commercial vendor backing it. You know, you think of Red Hats or you think of Elastic, you think of, you know, Kafka. So, all of these projects, yes, you could get far with open source, but at the same time, you could also get deep into it and you have, you know, backups if you’re willing to spend some money to get the support that you need.

Ethan Banks (25:32 – 26:00)

Greg, what I’m curious about, you settled on Nautobot as a source of truth and there’s a lot of other things that you can do with that tool. Okay, as you alluded to Steinzi’s network automation landscape document that is just massive with the tool explosion. How do you see that landscape evolving? I mean, Nautobot’s one piece of, I assume, a larger platform that you guys have built. Does that landscape consolidate over time? Are we just going to see even more tools? And what’s interesting to you guys as far as that landscape goes?

Greg Botts (26:00 – 26:05)

I should probably go look at it because since we’ve been shopping, it’s probably blown up even more, right?

Eric Chou (26:05 – 26:06)

Exponentially. Now it’s like page one of 10.

Greg Botts (26:07 – 26:43)

Right. Some things that I think, right, kind of like what we would be looking as we go forward. One thing is around the validation, network validation specifically, right? There is some stuff out there. There wasn’t a lot last time we looked and it seemed to be maybe not getting as much contribution action at the time. But that validation piece, right? We have all our data. What if we could, you know, before we push a change, you know, see exactly what is going to happen. And then after we push a change, you know, or take a snapshot, right? The copy, push your change, take another snapshot and see the diff. Like, did you just break something upstream? You know, that sort of thing.

Ethan Banks (26:43 – 26:55)

Well, you’re talking about like testing libraries, these kind of things. I had 1,500 routes in this OSPF autonomous system and I still have 1,500 routes. Or I have 1,510 and that’s what I was expecting. That kind of stuff?

Greg Botts (26:55 – 27:04)

That sort of thing, yes. I just pushed a change down on this leaf, which was pretty innocuous. But did I just impact something, you know, on my service leaf?

Ethan Banks (27:04 – 27:08)

And now we can’t reach our data center in Brussels. What the heck?

Greg Botts (27:08 – 27:15)

Right. Exactly. That to me seemed like an area where I would love to see some growth on that map.

Ethan Banks (27:15 – 27:37)

Well, what are you looking for in that regard? I’ve heard a lot of people talk about testing. And most of the wisdom seems to be everybody’s environment is somewhat unique. And so you have to build your own library of tests and some of the tests you’re going to learn the hard way. Had you only known to test it that one time, you would have caught that thing. Well, now we know. Next time we’re going to test it.

Greg Botts (27:37 – 28:01)

And that’s fine too, right? We do know our environment. And even, you know, I’m talking about two different infrastructures that we’re dealing with here. And so they’re going to have different rules, right? This infrastructure, I don’t care about data point X, Y, Z. But over here, I really do care about it. So, you know, we would be totally, I would expect to develop some of those. It’s just, you know, all of the guts behind that. I want something off the shelf with abstraction. There was, you know, some stuff out there.

Eric Chou (28:01 – 29:21)

Yeah, like Batfish or, yeah, I’ve had Ratul on the show before. And so, yeah. And he actually, you know, he’s a professor at UW now. And I’m a UW grad, right? So, you know, all the power to him. But I would totally agree. I think the validation and the, if you carried it forward enough to do digital twin, that you could simulate enough of your network, at least the important parts. You don’t need the 5,000 devices, but you need the smallest deployment unit, your pod, to be able to test and have enough confidence so that you go into that maintenance window knowing what changes that you made. And so from just running simulation and algorithm to calculate, you know, so the thing with maybe Batfish or other validation tools was that they were able to use algorithm that you don’t need to run it into the devices. You could do the, you know, the results output. But then at some point, you do want to run through maybe your virtual devices. Then you run through your physical devices in your lab if you’re big enough that you could do that. So I think all these degrees of validation depends really on your size. And I do agree. I think at the end of the day, my point is I do agree with you. I think there’s a lot of space that could be filled, but it’s a big space. So where do we start?

Greg Botts (29:22 – 29:44)

And we’ve done exactly what you said, right? We, you know, have a little bit of emulation software. You know, I remember one really complicated change. We modeled it out in the VM space, worked like a champ, rolled it out. And there was, you know, a certain toggle on the hardware side that you just couldn’t emulate in the VMware side. And so, you know, there’s, yes, there’s a lot of holes that could be filled there.

Eric Chou (29:44 – 29:53)

Yeah, exactly. Like your microburst on the buffer between like your backplane, right? I mean, I’m saying it because I have the scar to prove it.

Greg Botts (29:54 – 29:55)

Yeah, same.

Eric Chou (29:55 – 30:24)

Yeah, you know, I think that’s a great observation. And only for someone who had lived through all of that, Greg, that really pointed out, then it makes sense to have that. But at the same time, I am glad that the community is thriving and really takes all of us to push any area that we see fit. And that’s part of the beauty of open source, right? Like if you see something, you could go do something about it for me. So I don’t know if you have any additional thoughts on that.

Greg Botts (30:24 – 30:38)

And we did just that. We saw something that we needed. And we were able to, and right, there’s the community, right? Now you’ve got an entire development team. That’s huge, right? There’s a Slack channel, folks are responsive.

Eric Chou (30:38 – 30:42)

May not always be polite or in the same time zone, but hey, the team is there.

Greg Botts (30:42 – 31:07)

It’s okay. So we were, we were able to influence, you know, there was a certain, I don’t know if it was a bug, right? But there was a thing that we really needed and it was like, yeah, here’s what the problem is, was pointed out to us. We were able to, you know, my DevOps guide got in the repo. You know, we tested it out, found the solution. I think it was less than 10 lines of code probably. You know, did the pull request. It got merged in and solved our, what was going to be a huge problem.

Ethan Banks (31:07 – 31:16)

And just for clarification, you asked for this to be solved or you actually wrote that code? Your DevOps person wrote those 10 lines and got the PR submitted?

Greg Botts (31:16 – 32:05)

He did. We got guidance, right? We were like, hey, this is happening, you know, because I don’t know anyone’s seen this, right? There’s a great modeling of, you know, for a chassis device, right? That has a bunch of modules. We have this case where that’s a good number in our D environment. That’s a good number of SKUs out there are like that so that we can be flexible, right? This row needs, you know, that chassis is going to need 10 gig line cards, you know, four of them and 200 gig line cards. But in the next row over, it’s vice versa. And we didn’t want to model all those possibilities. So there was this great model of, hey, your device can have modules. You know, it mimics the, just swap out your line card, right? Your slot. That was working fantastic. We were trying to set up extra, like some peering relationships into those interfaces. And that part just, it was, you know, hadn’t been tried, right? It wasn’t a test apparently. So we needed that to work. And the Slack channel was like, hey, this is right where that’s happening. Here’s the URL to the repo.

Ethan Banks (32:05 – 32:08)

So it was a combo of the community built around. We’re talking about Nautobot in this case, right?

Greg Botts (32:08)

Yes.

Ethan Banks (32:09 – 32:50)

So it’s a combo of that community and then someone on your team, the two and a half people that was able to do that. That was another thing I wanted to clarify. I mean, Intel, you guys have been involved as Intel, the bigger company in networking and having networking products and being on the bleeding edge of networking and so on for a lot, a lot of years. It’s not like you had to tap into some deep inner Intel resource that the rest of us don’t have to get done what you needed to get done. It was just someone that knows a bit of coding and was able to get it done with some help from the community.

Greg Botts (32:50)

Exactly.

Eric Chou (32:51 – 34:01)

I think that’s just so great about, that’s what draws me in from the open source community in the beginning, right? Because now that you made the contribution, the next guy or the next person, you know, guy, person who faces that same issue, they could either use what you already contributed or they could build on top of it. And now everybody benefits. So I think it’s great that you, your team have done that. And the two and a half guy, the sitcom, right? Like, you know, yeah, it’s, it’s just, just standing on shoulders of each other and this, you know, collaboration across different continents, across different companies. And I would be the first to tell you that a lot of these open source things are built for initially for one company specifically. And then the company was gracious enough to open source it. And now everybody gets to use it. So it’s great that you’ve done that. And, you know, I look forward to more of that from you, from other companies, and we can all benefit from each other. And I guess that’s the model for Ubuntu, right? Like people who are holding hands and in a circle saying Kumbaya.

Greg Botts (34:02 – 34:05)

Yes. We’ll join that circle. Yes.

Eric Chou (34:05 – 34:52)

Yeah, exactly. So I think, you know, we’ve, we’ve gone through, you know, just walking back our previous conversation and your experience, right? So you, you started with your needs for the company, you went shopping, you clarify, you know, the features and you picked a tool, maybe multiple tools that you mentioned or did not mention. You run through these tests, the features and make sure that they fit your needs. And now, what did you, what were you able to show for it, right? Like, at the end of the day, I saw some amazing stats from that paper on what ended up going from that journey, not a short one, but any specific measurable outcomes like your incident reduction, device growth, and that you gain from just going through all these pain.

Greg Botts (34:52 – 36:31)

The one that jumps to mind is, is our provisioning speed. Previously, right, it was a bit more dynamic, right? And then it starts doing, you know, where, who’s my LLP neighbors and that’s how we set our peering and, and things like that. So, and that’s a tough one to measure, right? Because our network engineers are doing 10 different things at once and they’re trying to provision something quickly. But we did some analysis and figured that was taking about eight hours, you know, to provision stuff. Now it’s down, we conservatively said two hours. What enabled that was the ability to pre-build all of that stuff. So now they can upload a minimal amount of data, you know, into our network source of truth and it’s enough to spit out the entire config. So now when that thing, you know, boots up, now ZTP process isn’t just giving it the bare minimum to get on the network. Now, if they’ve done that step ahead of time, now it’s up on the network, it’s appeared and it’s ready for, for client connections. So that workflow really, that was an efficiency gain. That was pretty nice for us that we do every day, right? That’s how we’re keeping up with that growth. The other thing we’ve seen is, and we put this in the paper, was incident reduction. I think if you go back to like 2020, we had maybe 12, what we would classify as major incidents across the landscape. And then if you look going forward in the D space, we had like four years in a row with zero. Wow. And some of that was a new platform, right? And we weren’t dealing with capacity things. But a big piece of that, in my opinion, was the config standardization, right? We didn’t have the one-offs, it’s the one-offs that those are the landmines. Yeah. And we didn’t have as many of those. So I think that attributed to some of those numbers.

Eric Chou (36:32 – 37:18)

Yeah, they kind of impact each other, don’t they? Because of the repeatability, you change your mindset about what we could do. Because it’s so easy to bring up new devices, maybe the solution today is actually to throw more money at it, right? Like, you know, because of business needs. Sometimes that’s the right direction is just to have more capacity. And the capacity meaning, you know, more devices, more scalable, your leaf spine design, where if I read it correctly, it was like a five stage, right? So that’s actually a little bit more than enterprises today. But at the same time, it is a very scalable design for now and for the future. So that, you know, we could have multiple options as opposed to the traditional, you always have to scale up as opposed to scale out.

Greg Botts (37:19 – 38:08)

I would totally agree with that. You know, we were constant in the old core distribution access, right? We were constantly chasing layer two issues. We were constantly fighting congestion. We architected the leaf spine for the peaks. Right. So now we’re not spending our time dealing with congestion. We’re not chasing the layer two issues because it’s all plumbed out with layer three. And that led us, right? And then we turned around. It’s like, oh, now you have this giant underlay, especially in the D, right? That five stage. Yeah. One that you’re talking about. This is a huge underlay. So now when we came in, when we do get those external customers, oh, we need to isolate that workload. Well, I can do that with an overlay now. And I know my underlay is all standardized. It’s all got plenty of capacity. I can just throw some overlay on top of that. And we were able to deliver solutions pretty quickly that way.

Ethan Banks (38:09 – 38:10)

You’re running a five stage clos fabric?

Eric Chou (38:11 – 38:14)

You said five stage, right? Yeah. I saw that. I highlighted it.

Greg Botts (38:14 – 38:15)

One of our big ones, yeah.

Ethan Banks (38:17 – 38:45)

Yeah. I mean, to me, that’s always been the trick of scale out. If you want to succeed at that, you’ve got to keep everything homogenous. It’s all got to be same, same. As soon as you have one offs and corner cases and little weird things, oh, we have this one exception on this. No. No. You just have to say no. You don’t want to be the guy that says no, but you have to. Or else the whole thing just falls apart. And how many, do you know how many access ports you’ve got in that fabric?

Greg Botts (38:46 – 38:48)

Oh, I should have that number, but I don’t.

Ethan Banks (38:48 – 38:53)

It’s got to be, yeah, it’s got to be many, many thousands, tens of thousands, I suppose.

Greg Botts (38:53 – 38:55)

My network source of truth knows.

Ethan Banks (38:55 – 38:55)

Yeah, yeah.

Eric Chou (38:56 – 38:57)

There you go.

Greg Botts (38:59 – 39:01)

You can do that with an API. Yeah.

Eric Chou (39:01 – 39:25)

API or GPT, right? Like, Nautobot just came out with Nautobot GPT. Although it’s in beta, but hopefully one day we can actually use natural language so anybody could ask, say, hey, how many access ports, how many distribution ports do we have out there? Which brings me to the next question, which is AI, right? We have to mention AI because that’s the day of age we live in.

Ethan Banks (39:25 – 39:28)

You don’t have to, Eric. You just did.

Eric Chou (39:28 – 39:30)

I resisted, but I resisted long enough.

Greg Botts (39:30 – 39:32)

You can’t have a podcast without it.

Eric Chou (39:32 – 40:14)

Yeah, exactly. The AI police isn’t going to come and get me. But yeah, you did mention about the huge underlay and something we didn’t even touch on, the business flexibility, right? Like your agility to integrate new features and new requirements relatively easy without redesigning your architecture. So now that we have AI, which I imagine puts a lot of stress on your infrastructure or did it? So what is it about the automation that you built, the infrastructure you have underneath that enable you for the new possibility that’s what we will generally label the AI, right? The trainings, the usage.

Greg Botts (40:15 – 41:33)

So now we’ve got this, step one was the data, right? We’ve just set the table basically for like an AI feast. We have the data. So step one, check. Now we’ve got this in our system, we’ve got this combination of abstraction. And then if you throw some AI tools at it, maybe now like we don’t have to go to the unicorn farm, right? And find more unicorns. Now our network engineers, that onboarding ramp to participate in the automation system and develop the next, you know, whatever thing we need to add onto it. That just got less steep. Very approachable now for our uplevel, for our network engineers to uplevel. Now it’s not, hey, you need to become an expert Python programmer so you can contribute to our solution. That’s, you know, a hundred thousand lines of code and a database with some tables. Now it’s, hey, you know, look at how our data is modeled, figure that out, figure out if we need a new data component, figure out how that gets modeled. That was a fun thing to kind of learn when we did this. It’s a different skillset, right? Data modeling. And then, you know, do some Jinja and things that are very, very approachable now. And like you said, there’s the possibility now to interact with our data in a very simple way, right? Write me a report. Tell me how many access switches I have in my data center. So Ethan can have this question answered. Write me a job to go, you know, do this thing, that sort of thing.

Ethan Banks (41:33 – 42:04)

You’re using AI in a couple of different ways then in your mind. One is to mine the data that you’ve got to find out interesting facts about the network, things that are useful. It could be the number of unused access facing ports that I’ve got. It could be, you know, the number of VLANs, you know, statistics, reporting, these sorts of things. But then also you’re saying you want to use AI to help you with automation. I need to develop this new feature or a new template or whatever it is. Help me generate that stuff.

Greg Botts (42:04 – 42:51)

Exactly. We have a list. In fact, you know, we kind of keep our backlog on our little two and a half person team of things that our team needs, right? And most of them are in the Nautobot parlance, their jobs. And Nautobot’s done a great job of stubbing out, right? That you don’t have to write all that Python from scratch. It’s got some of that stubbed out, but that’s still a, that’s a Python program, right? Script that you’re running to go do the thing that you want to interact with the data. Maybe you’re interacting with our devices, doing logic, whatever. Things like that. If we can, you know, throw some AI at that. We haven’t yet. We just finished this. In fact, don’t tell our managers yet, but I haven’t quite finished the migration. We’re down to like the last couple hundred devices. So we’re almost done. And that’s what we’re hoping to kind of do next.

Eric Chou (42:51 – 43:42)

That’s a great point. AI is both, you know, you’re actually doing two parts, right? You’re building the infrastructure to enable AI. But at the other side, you’re also a user and consumer of these end result AI product after they’re trained, they’re, you know, specialized, tuned or whatever. That’s, it’s on both ends. So it’s a very interesting aspect, which I think it leads logically to the next question I have for you, Greg, is that, you know, just from our short conversation, we’ve gone through so many iterations on, you know, skills, the tools, the everything that we’ve gone through and all the way up to how to use AI to, you know, right now about jobs, for example. So if you were to start your journey over again today, right? Like knowing what you know now, this is actually my favorite question. What would you do differently, if any?

Greg Botts (43:43 – 44:56)

For this journey and for our situation, I 100%, if I could go back and go back to that phase one, go back to that BC time, I would do 100% of our config be automated, right? We were, we started out in that phase one, and there was, you know, maybe 80%, a good, a good chunk of it for sure, more than we had done in the past. But there was, it was very easy to add on that other 20% or for customization, right? And that you’d be surprised how many corner cases, how many one-offs you can end up with an environment this big. With phase two, we are now rendering 100% of that config. And as, so what the migration looked like was, hey, bring in all the data, you know, from your current network device, our unicorn wrote this fantastic jobs, not about jobs to go bring in all the data into not about, render the new config and then kind of compare it, right? A DevOps guy wrote a tool that says, here’s the rendered config. Here’s the running config. What are the differences? And, you know, I can push that out to the box for you. And, you know, I thought we’re going to be very clean. You know, there shouldn’t be hardly any drift. And sure enough, there was, there was more than I thought, right? And like we just talked about those one-offs, those are the landmines that blow up at some point. So if I could go back, I would, I would do a hundred percent.

Eric Chou (44:56 – 45:54)

Yeah, but also the point is now, you know, right now, you know, so you know what’s the road ahead of you as opposed to previously. Maybe you were in the dark until something goes wrong and that bites you in the butt during a maintenance window or not worse, right? So, so yeah, now, you know, so these, these lessons were not wasted. And, you know, these are hard fought battles that, that gave you this confidence. And now that now we know what to tackle. So I think we’re, we’re kind of, you know, moving toward, toward the end of our podcast here. And we could feel like we could just go on forever, but I do want to ask like looking forward, where do you see the biggest opportunity for network automation is, especially for environment like Intel? You know, we’re years of, of automation, years of process or doing the orchestration lessons learned. Where do you think that the biggest opportunity for network automation is in this area?

Greg Botts (45:55 – 46:20)

So, so in my space, in the D and E space out of DOME, the validation that we talked about to me still is, would just really be the next kind of, kind of game changer. I would also like to tackle taking streaming telemetry from all those 5,500 devices and being able to, you know, that’s more data for AI engine to, you know, now you can start looking at self-healing kind of activities, right? Go agentic with your, with your AI.

Ethan Banks (46:20 – 46:34)

Do you have a scheme for that? Because you’re talking streaming telemetry of 5,500 devices. That is an enormous amount of data. Do you have a specific thing in mind, like the kind of telemetry you’re looking for, as opposed to just turn it all on and we’ll figure it out?

Greg Botts (46:34 – 46:40)

I don’t know. I’m going to go to Steinzi’s map and see what, see what’s out there and see how much it can handle and then go from there.

Eric Chou (46:40 – 46:42)

Yeah. Is that before or after your crash?

Ethan Banks (46:43 – 47:23)

But to speak to your point, it sounds like you want to surface maybe gray failures. There’s an optic that’s going down to the data center. How do you detect that? And would you have that many devices? It can be a challenge, but AI is really good at finding those oddities in the data that might go unnoticed because you’re still passing traffic. But all of a sudden you’ve got some number of, you know, discards or retransmissions or something happening that you can pick up on from that streaming telemetry perhaps. And then AI can put the pieces together and go, there’s something wrong and it’s in this part, this tier. I think it’s this switch. You should look at this and see what you can find. In that exact voice too.

Greg Botts (47:24 – 48:05)

I think in general, right, to summarize the answer to that question, we’ve now democratized our network data. And so really looking for like, I think we can increase efficiency kind of across Intel IT, right? Now maybe our cabling techs can work off the same data, you know, instead of like us passing a spreadsheet around and generating spreadsheet and making mistakes. Now that, you know, we’re all working off of the same data, right? The server guys, rack and servers can look at the same rack that our network switches in and we’re all working off the same data. So that can just lead to like self-service applications, right? And so on. So I really want to, I think that’s kind of the other big opportunity that we’re looking for.

Eric Chou (48:05 – 49:29)

Yeah, I think observability and telemetry has always been just a huge headache for networking. Even in previous lives, any vendor that came in that they swear and pound the table, say it’s not going to break, right? Like we’ve seen things that they don’t say, that’s not a problem without, when we put it under stress, when we put it under scale, that’s just going to break. We have, I know in the paper you mentioned S-Flow, right? So as opposed to NetFlow, you have S-Flow that are sampling and, you know, you could decide to be aggressive or not and all that. And we have like the, you know, like the big databases, right? Like the schema list or like the relational, non-relational databases that we have out there, but still, there’s still this huge gap that I see for, you know, deep inspection, the deep telemetry data, which just like you, I hope we’ll find solutions one day. And to me, I am biased toward open source and biased toward, you know, all these solutions out there that I think somehow we could put all of our brains together, put our brilliant minds together, all the DevOps, all the unicorns, and then we’ll come up with something that works for everybody. But I don’t know, it’s been years. I have that hope and I still don’t, I don’t see any promising projects. I don’t know. What do you have in mind? I mean, or was it just, you know, kind of, we’re in the same boat there.

Greg Botts (49:29 – 49:58)

We’re kind of in the same boat. I mean, what’s encouraging to me now is there are, you know, conferences. There’s an AutoCon. That’s a thing now. That wasn’t a thing five years ago. Right. So people are, and it’s growing. Yeah. Just the, the ecosystem, the fact that there’s a map out there that Steinzi has and that it’s growing is, is encouraging. So it, it takes a deeper look. We haven’t been able to, we’ve been kind of the two and a half guys were saturated kind of getting to this spot, but I’m encouraged now more than ever in this area.

Eric Chou (49:58 – 50:16)

I agree. I agree. So before, before we wrap up, is there any, you know, last call to action or, you know, if you were to, if somebody else is listening to this podcast and wanted to know what is the first step, right? Like, so is there any call to action from your end before, before we wrap up anything you want to cover?

Greg Botts (50:16 – 50:19)

I think first thing is they should listen to your podcast.

Eric Chou (50:19 – 50:21)

I appreciate that. Thank you.

Greg Botts (50:21 – 50:51)

I would honestly like to, to think, I mean, both of you guys for the work that you’re doing, because what you’re doing, going and evangelizing, right. Going and telling these stories. That’s one of the reasons I think that when we went shopping, we were able to kind of navigate that landscape because we, you know, you hear what other folks are doing. You realize there’s a map out there, right. Go look at that sort of thing. Your podcasts are also enhancing the RTO experience as a site. There’s your tagline. Enhancing RTO.

Eric Chou (50:51 – 51:17)

I’m going to crop that and be the promo for this episode. And I want to echo that. Right. So I’m the newcomer here, but Ethan’s been doing this for like 10 plus years. I feel like it’s 10 plus years. I don’t know if exactly. It’s pretty close. Right. Ethan.

Ethan Banks (51:07 – 51:08)

It’s 15. Yeah. I started podcasting in 2010. Yeah.

Eric Chou (51:09 – 51:37)

We can end this on a positive note, right? Yes. There is a community. There is a ecosystem that is growing. And yeah, if I remember correctly, it was, you know, 600 people from Autocon 4, but Autocon 3. I’m sorry. Autocon 2 and 3 is probably, you know, a little different because they’re in Europe, right? So let’s just compare four. So it’s going to be over a thousand that we anticipate for Autocon 4. Ethan’s going to be there. I’m going to be there. Greg, I hope you’re there as well.

Greg Botts (51:38 – 51:39)

Hope so. Trying.

Eric Chou (51:39 – 51:47)

Thank you, Greg and Intel for sharing your story. Thank you, Ethan, for joining us today. I couldn’t ask for a better co-host.

Ethan Banks (51:43 – 51:47)

Thanks for inviting me, Eric. I enjoyed it as always.

Greg Botts (51:47 – 51:52)

Well, thank you very much. I enjoyed talking to you both. Thanks.

Eric Chou (51:52 – 52:12)

And thanks to Network To Code for sponsoring today’s episode. Don’t forget to check out their solution at networktocode.com. Do you have any feedbacks for Network Automation?Our guests great today or this episode? Please do send us some follow-ups at packetpushers.net forward slash follow-up. We do want to hear from you. Last but not least, remember that too much network automation will never be enough.

Source

About Author

Network To Code

See author's posts

Tags: Podcast

About Author

Network To Code

Related News

You may have missed

Categories

AF themes

Tag Cloud