15 Apr 2017, 14:50

Cloud Providers as Lean Suppliers

“Move toward a single supplier for any one item, on a long-term relationship of loyalty and trust.” - Deming

One of the parts of Lean Manufacturing is that any optimization that only focuses on your is the same as any optimization that’s focused on one part the floor - “an illusion.” It’s a local optimization and doesn’t look at the system as a whole. You may not actually improve anything. There’s a pragmatic approach to it, so sometimes you focus on your sphere of control instead of just your sphere of influence, or you wait till “the right time,” but the point remains. You have to look at your supplier (or consumer) and see where optimizations there can make your delivery more optimal. This is so prevalent so that’s even books on it regarding Toyota’s, the Lean poster child, supply chain.

Now, being able to have that kind of influence on your supplier is a tricky prospect. Some simple thoughts is that it requires a relationship where you as the customer have a lot (maybe undue) influence on the supplier that the supplier is coerced into doing what you want, or that the supplier is so customer focused that they’ll do anything that you want. The simple thought with regards to Toyota falls into the former category. The not so simple thought is that this interaction hinges on a deep (not just lengthy, but to the degree of interaction) and mutually beneficial relationship. The key to a relationship is that both sides work on it and work a lot of give-and-take into it.

This is quite the opposite from a push to multiple suppliers that are all interchangeable. Establishing the relationship is secondary to being able to switch away from a supplier at any time for any reason.

The truth is that nothing is black and white, and different situations are at a different level of gray. One company may work very hard to be supplier-agnostic, while another may work very hard to establish deep relationships.

The same plays out in the IT arena - in multiple ways and multiple levels. You have long standing deep relationships with existing vendors, and you have senses of “this vendor is interchangeable.” Interchangeable is always iffy, as that depends on the commoditization of the both the technology and the process that uses it, but it is a goal that is longed for. This works down even to code level where one interface may have multiple implementations. Whether that that last logic is the source for that logic showing up in vendor management is anyone’s guess.

Today it’s an open question of how deep does an IT group get with a Cloud Provider. It’s not an easy statement to say that that should never be very deep. Even if you maintain a hard line of only relying on features that have parity across multiple vendors, you still have to make sure that the interface surface for managing those features are on par. There’s many types of lock in that we tend to underestimate the size of the interface surface (even to the point of ignoring many parts of it).

“But how do I maintain a relationship where I’m tied into a lot of them, but that I’m only a small part of their revenues?” Well, there’s no simple answer. The Cloud Vendors are some of the largest companies in the tech industry, and hard to hold any cards over. But they got where they are by having a huge focus on the customer (Amazon even claims to be the most customer-centric company on earth), so you can assume a bit of work to be done connecting with you.

The best advice I can see is “be pragmatic and mindful, and make sure you’re communicating.” There’s going to be issues maintaining that relationship, and there’s going to be risks with having that relationship. Don’t go into it expecting a silver bullet or that it will never be rocky. There will be bumps on the way, and there will be back-and-forths. The key is to know that you have to work at it, and make sure each side knows what that work is. And realize that sometimes it’s just necessary to lift a lot yourself to be able to move over to a new relationship.

20 Feb 2017, 12:37

ancl connecting multiple models

One use case not previous documented is the one of connecting two models together. This can be used to cover some of the combinations of the same model applied to a node in different ways and node in multiple models and self-referential. In our previous models, we fully connected everything or made it orthogonal (“node in multiple models”). The key difference here is that all of the previous ones don’t fundamental change the model world - it’s a matter of making sure that a single model at a time is applied to the nodes.

This is a matter of connecting two models in a way that could be seen as making a new unified model.

Note: below, I’m going to be using two new notations: a. There’s an added yaml dictionary level for the model name. b. Some of the model details (specifically: target ingresses) may be purposefully left out.

The simple example of an app talking to Cassandra:

appmodel:
  client:
    egress: [[app,appapi]]
    ingress: {}
  app:
    egress: [[db,binary]]
    ingress:
      appapi: [8009,8009,"tcp"]
  db:
    egress: []
    ingress:
      binary: [9042,9042,"tcp"]
cassandra:
  client:
    egress: [[db,binary]]
    ingress: {}
  local-server:
    egress:
    - [server,plain-gossip]
    - [remote-server,encrypted-gossip]
    ingress:
      binary: [9042,9042,"tcp"]
      plain-gossip: [7000,7000,"tcp"]
      encrypted-gossip: [7001,7001,"tcp"]
  remote-server:
    egress:
    - [local-server,encrypted-gossip]
    ingress:
      encrypted-gossip: [7001,7001,"tcp"]

appmodel and cassandra

One way to handle this is to not. Basically, avoid some of the overlaps here - e.g. lose the db role inside of the appmodel. Make it look like:

no overlap

Then you could say for role assignments:

appnode01: [appmodel::app,cassandra::client]
cassadranode01: [cassandra::local-server]

While this works, it doesn’t seem like a good idea. Without the additional role assignment context, you don’t know that the app node requires DB access. The model really is incomplete.

Another approach is to be able to link the two models - to say that nodes in one model are equivalent to nodes in another model.

linked

There’s a couple of ways to do this.

  1. Connect the edge: The overlap of edges is to recognize that appmodel::app----binary---->appmodel::db::binary is the same as cassandra::client----binary---->cassandra::local-server::binary
  2. Connect the node pair: Similar but just say appmodel::app == cassandra::client and appmodel::db == cassandra::local-server

The first might be more accurate assessment of what it actually is, but it is more verbose and I’m not sure it actually achieves any real difference. So, just going with this, a linkage might look like:

[appmodel,cassandra]:
- [app,client]
- [db,local-server]

This doesn’t look like good yaml. I’m not sure every parser can handle it, so this might change. The key is to be able to descern the tuple.

Implementation

The interesting part is that this can be implemented without fundamentally changing anything previous. The only piece to add is to extend the roles automatically - wherever there’s a linkage, add the linkage role equivalent to the node’s roles.

appnode01: [appmodel::app]
cassandra01: [cassandra::local-server]

becomes

appnode01: [appmodel::app,cassandra::client]
cassandra01: [cassandra::local-server,appmodel::db]

Inferred model roles

Above, the db::binary and local-server::binary are redundant: the port information is defined twice. In reality, the appmodel is more concerned about the app itself and less concerned about the specifics of the DB; it just cares that there is one. In the case of knowing that there will be linkage, you could consider stubbing out the role that is going to be the crux of the linkage. This might look like:

appmodel:
  db:
    egress: []
    ingress:
      binary: false

From here, it says “there’s a db role, but it needs a linkage to make it concrete.”

Given that the stub isn’t really saying much that isn’t already implied by the egress on the “app” role, it’s possible to not even define the db role - not even having a “db” dictionary key - and inferring that there’s a “db” role. Similarly, since the “client” role only has the egress which is references elsewhere, it’s possible to infer that as well.

From an automatic aspect, this means that the role expansion could be limited. However, since both sides are likely to have something stubbed out or inferred, it could be easy for this to fall into the trap of not expanding either side and we’re back to step one. So, one side has to remain. Since the realizations are going to happen from the target side, I think it makes sense to keep the source side roles. This means the above roles would look like:

appnode01: [appmodel::app,cassandra::client]
cassandranode01: [cassandra::local-server]

The added bonus of the inferred roles is that they can be automatically replaced when processing. This might make realizing the configuration output a little bit easier. E.g. don’t end up with many incoming policies for the cassandra::local-server connection; end up with one (since all of the inferred target roles are not expanded) with multiple sources (those roles are expanded) for each linked context/model.

Self-referential update

This can be used to connect two cassandra data centers together (dc1,dc2). The linkage:

[dc1::cassandra,dc2::cassandra]:
- [local-server,remote-server]
- [remote-server,local-server]

This is basically saying that in the dc1::cassandra context, dc2::cassandra::local-server is a remote-server, and vice-versa. So node assignments go from:

cassandra-dc01: [dc1::cassandra::local-server]
cassandra-dc02: [dc2::cassandra::local-server]

to:

cassandra-dc01: [dc1::cassandra::local-server,dc2::cassandra::remote-server]
cassandra-dc02: [dc2::cassandra::local-server,dc1::cassandra::remote-server]

This seems to work since the linkage doesn’t create any redundant roles in this use case. That may fall down if there is, but I’m not seeing how that would come up just yet.

Not done yet?

The downside of this is that since no new actual model is defined, contexts aren’t automatically done. So, the above works for what it is, but the linkage can’t necessarily be promoted. It can be, but that might be overdoing what should be done. This has some impact on the linkage of models vs linkage of context+models.

So, I have to go model a bunch of stuff at work and see if this can express in a simple way everything that I need there.

18 Feb 2017, 15:03

ANCL: Use Cases

I had a chance recently to revisit ANCL in two ways. I recently had to compare some firewall rules between two different firewalls that were setup to mirror each other. The original firewall was not set up using any modeling of firewall rules so very much fell under the issues that I originally commented on. Lessons learned:

  1. It’s much easier to collect/reason about the communication pattern when you look at it from the “what do I need to talk to” perspective instead of the “what talks to me” perspective. People are blocked when their downstreams aren’t working and so have a bit more motivation to make sure those are well described.
  2. Even JSON is a bit more verbose than that I wanted to deal with when working on the rules in mass.
  3. To make it happen, I simplified and didn’t attempt to build out any kind of hierarchy or dependency. Even though some would stem from the same model, I manually created the instantiation of those models. I don’t have a good approach for this yet, but working through the concrete example gave me a better understanding.
  4. It’s easy to use IP address in the context of firewalls, and you can overlap anything that has that address or containing CIDR.
  5. Naming is hard (1): There seems to be a bit of redundancy with model roles. If I have service, what do I specify for clients and what do I specify for the port which the clients connect to? In both cases, I want to use “client”
  6. Naming is hard (2): It’s still not clear to me what to use to describe the generic descriptions (e.g. models), the components in those descriptions (e.g. roles), and the instances of items in those roles (e.g. nodes?). I keep using the term roles in the place of the nodes - I think.

Separately from the firewall, I’ve been looking at using this to help figure out overall communication matters. I’m trying to bridge together different applications running by different groups and using different interconnect mechanisms. I need to get quality and slot information for what’s talking to what. That’s got me thinking. A few more items to postulate:

  1. Mental exercise: How does routing information play into all of this? Does different routing affect how the models are structured?
  2. Mental exercise: Can I use the models to influence aggregation and reporting on netflow data. Each netflow entry could be associated with a specific model which gives a lot more context than protocol, port, and subnet which ends up being the bulk of what I usually see?
  3. Mental exercise: What’s it looks like to add additional information to each model? Not just “443/tcp” but also “100Mbps”, “TLS”, and “client certificate required”?
  4. Mental exercise: In the first model, I associated roles to specific IPs. What’s it look like when instead of IPs, I use AWS instance IDs, or AWS security group IDs, or container IDs, processes, etc?

So, there’s a lot more interesting stuff beyond just the firewall, and it’ll be interesting to see what comes up. But I still worry about the complexity, so I want to figure out ways to reduce that complexity.

The first one is to not have specific models (“this application’s Oracle DB”) for everything and to be able to use more generic models (“Oracle DB communication”). This means having an ability to reference a model. I’m still not sure how to do that. So, I’m trying to take a step back and come up with some use cases to help noodle through this. So with that in mind, the remainder of this is about examining that. I’m not committing to anything so you’ll see possibly a few implementations below.

Use Cases

Simple 3 Tier Application

This is your classic three tier application.

Client->Web->App->DB

A sample general model could look like:

client:
  egress:
  - [web,webapi]
  ingress: []
web:
  egress:
  - [app,appapi]
  ingress:
    webapi: [443,443,"tcp"]
app:
  egress:
  - [db,sqlnet]
  ingress:
    appapi: [8009,8009,"tcp"]
db:
  egress: []
  ingress:
    sqlnet: [1521,1521,"tcp"]

Shared DB

This is the case of the same model being applied in two different context with one overlapping resource. The example is a shared DB resource (here shared between dev and prod, but probably shared across multiple DB)

prod/dev share db

A fully expanded model could look like:

dev-app:
  egress:
  - [db,sqlnet]
  ingress: {}
prod-app:
  egress:
  - [db,sqlnet]
  ingress: {}
db:
  egress: []
  ingress:
    sqlnet: [1521,1521,"tcp"]

However, in reality, there’s a base model which looks like just:

app:
  egress:
  - [db,sqlnet]
  ingress: {}
db:
  egress: []
  ingress:
    sqlnet: [1521,1521,"tcp"]

The question is really about how to relate multiples together. Looking at roles:

prod-app: ["prod::app"]
dev-app: ["dev::app"]
db: ["prod::db","dev::db"]

This works in this simple example, but I’m not sure it covers everything (see below).

Same model applied to node as two different roles

This is one that masks quite a bit so it’s not clear what the perfect setup is. The simple case is that there’s a DB that serves sqlnet, but in turn also connects to other DBs using sqlnet (e.g. replication).

main-db -> ro1-db -> ro2-db

This could looks like:

db-client:
  egress:
  - [db-server,sqlnet]
  ingress: {}
db-server:
  egress: []
  ingress:
    sqlnet: [1521,1521,"tcp"]

main-db: [main2ro1::db-server]
ro1-db: [main2ro1::db-client,ro12ro2::db-server]
ro2-db: [ro12ro2::db-client]

The “db-server” and “db-client” part feels a bit weird. I kinda want to just have “server” and “client” but then feel like I need another name hierarchy - e.g. “db::client” and “db::server” - so the roles would look like:

main-db: [main2ro1::db::server]
ro1-db: [main2ro1::db::client,ro12ro2::db::server]
ro2-db: [ro12ro2::db::client]

This looks ok, but there’s two concern for me:

  1. How many things are “client” or “server” so would there be a way to simplify that?
  2. Having to have a context for all of the directed pairings seems a bit overdone. Is there a way to simplify that?

The latter concerns me more. Maybe not using the pairwise, and looking at the context to be a bit more on the node (in this case) itself:

main-db: [main::db::server]
ro1-db: [main::db::client, ro1::db::server]
ro2-db: [ro1::db::client]

Node as Multiple models

This is the case of having a node participate in multiple models. The example is that the node is part of its main role (app or db), but it’s also being monitored and logged into (so, “adminee” controlled by “adminbox”).

(insert app/db models above)
adminee:
  ingress:
    ssh: [22,22,"tcp"]
    snmp: [161,161,"udp"]
  egress: []
adminbox:
  ingress: {}
  egress:
    - [adminee,ssh]
    - [adminee,snmp]

With an example of multiple roles put together:

prod-app: ["prod::app","adminee"]

Overlapping attributes

This is more of “if there’s overlapping attributes it needs to get the roles of any roles that match that attribute.” Simple example of overlapping IP addresses/CIDRs:

"192.168.1.50/32": [adminbox]
"192.168.1.0/24": [adminee]

In this case, 192.168.1.5032 has both [adminbox,adminee]

Self Referential

Some models are a bit self referential. Nodes of the same role will talk to each other (cluster members). Nodes of the corresponding role (cluster members in differnt subsections of the cluster) will talk to each other in another way. The post child for this is Cassandra:

Cassandra Fun

So, a model might look like:

client:
  egress:
  - [server,binary]
  ingress: {}
local-server:
  egress:
  - [server,plain-gossip]
  - [remote-server,encrypted-gossip]
  ingress:
    binary: [9042,9042,"tcp"]
    plain-gossip: [7000,7000,"tcp"]
    encrypted-gossip: [7001,7001,"tcp"]
remote-server:
  egress:
  - [local-server,encrypted-gossip]
  ingress:
    encrypted-gossip: [7001,7001,"tcp"]

And the roles might look like:

app-dc1: [dc1::cassandra::client]
app-db2: [dc2::cassandra::client]
cass-dc1: [dc1::cassandra::local-server,dc2::cassandra::remote-server]
cass-dc2: [dc2::cassandra::local-server,dc1::cassandra::remote-server]

I’m actually surprised by this model. It seems to be one of the cleanest but it’s also pretty complex. Feels like a trap but I’m not seeing it yet.

Uh… distinct items?

I’m having trouble describing this one, and a bit about reasoning about it.

The general idea is that there are cases where you need to have a general pattern, but replicated a lot of times with specific contexts. The simply example would be to have 30 nodes - each of which have a self referential pattern that only refers to them. This is kinda like the Cassandra situation with the subtle distinction each Cassandra node talks to all other Cassandra nodes and in this case each node would only talk to itself. Effectively each node is its own context for a role (for as ugly as that sounds) that follows the pattern.

There’s two practical answers for this right now:

  1. Since it’s self-referential, it actually is unlikely to be needed to be defined (most people can talk to themselves and processes are probably listening on localhost anyways - which has overlapping IP space and thar be dragons with trying to reason down that one right now).
  2. You can enumerate each as a separate context - this seems like a workaround, but it at least allows for it, just not efficiently.

So, that may be enough of a starting point.

coda

I think that’s enough for now. Definitely something to help ponder through all of this…

17 Feb 2017, 07:08

CAD, not CAP

Not being Partition Tolerant just means that you’re not Distributed, so CAP can be read as CAD - this might help with reasoning about it.

There’s this thing running around called the CAP theorem by Eric Brewer. It’s meant to show you that you have to make a tradeoff when designing a system - like “you can be good, fast, or cheap; pick at most two.” You pick a spot somewhere in the CAP triangle.

CAP Triangle

To paraphrase:

  • “C” for Consistency: you get the most recent write or an error,
  • “A” for Availability: you always get an answer,
  • “P” for Partition tolerance: you can still talk to the system (outside-in) even if there’s internal communication issues. E.g. a Partition happens when a node in the US loses its communication to a node in Europe due to a DDoS attack that takes out provider internet access.

The theory is that you can only do 2 of these*. The AP system sacrifices consistency meaning that you can get different answers during/after a partition. The CP system sacrifices availability meaning that during a partition, some part of the system in unable to serve data. The AC system sacrifices… having a partition?

This is where the language does feel like it fits. The only way to choose a AC system is to not permit a partition. The tolerance isn’t being able to handle it when it arises so much as it’s allowing it to happen at all. The only current way to handle that (short of advances in quantum entanglement) is to have a monolith and not a distributed system. But the CAP theorem is only in the context of a distributed system.

Maybe the AC is a degenerate case of “when everything is functioning fine”, or it’s meant to handle the degenerate “distributed system of one.” I don’t know what Brewer’s original thought on this is, but it seems a bit off to handle this by calling it Partition Tolerance. It really feels like it shouldn’t be a part of it, and that you’re left with deciding where you want to be on the AC line of the triangle - which really just turns this into a line:

AC Line

Every distributed system has to figure out how it’s going to handle the inevitable Partition events that happen in it. It’s a fundamental property of distributed system.

So, next time you hear “Partition tolerant”, process that as “distributed” and see if that makes it easier to handle.

* Writer’s Note: The recently released Google Spanner claims a bit of being able to do all three. I haven’t looked at it yet, so maybe I’m wrong in my thought here.

08 Jan 2017, 20:45

Context Permissions

I recently revisited AWS permissions, and found that, even though they haven’t fixed a seemingly significant design flaw, they’ve institutionalized the work around. But even so, I’m sure I’m all in on what has been institutionalized as being what is commonly needed even if it is what is traditionally done.

I’m talking about multi-tenancy inside of an account.

Typical permission models are built around:

Principal X can do Action Y with Item Z.

The way you distinguish any of X, Y, or Z is variable. Most of the time, X is “this user” or “user with this property (e.g. in this group, with this role, with this property)” - arguably, “this user” is the degenerate cause of having “user with the property of ‘this ID’” but meh. Similarly, Y and Z can be “this specific one” or “one with this property.” In many permission models, there’s a special property of Item called Owner that called out or elevated (i.e. automatically applied).

AWS allows you to create user accounts for managing the AWS resources inside of an account, and use these user accounts in an X/Y/Z model. The model was originally built around keeping people to specific functional silos. Some example actions just inside of EC2 for instance management:

ec2:DescribeInstances
ec2:RebootInstances
ec2:TerminateInstances
ec2:RunInstances

The first three operate just fine in this model as long as you can get the Item property (e.g. Owner) set correctly. The last one creates a seed issue for the first three: unless you know/work it at creation time, you (as the Principal) can’t go back and apply the Item property to a specific Item without being able to apply it to all Items, and without that the Item becomes orphaned.

AWS has built a lot around the tagging system, CloudWatch, and Lambda to allow for a work around. It starts by specifying that access is granted on the conditional of being equal to $aws:username. In short, it looks for the create events via CW, fires off a Lambda function that then applies the Owner tag. There are two concerns here: 1.) While this works, it feels fragile - I’m waiting for items to slip through the cracks and become orphaned. 2.) In the supplied work around, it is tied to the AWS IAM account that did it which is very limited. It would be much nicer to apply this to a Role (not doable from what I can tell, though maybe Lambda can reference the authentication logs and see what Role you switched into, but that seems unlikely), or if it applied this to a Group (more easily looked up, but which of your Groups).

In multi-tenancy, what you’re really looking for is context. Sometimes, I’m working in “my workspace;” other times, I’m working in “project A” or “project B.” Instead of an Owner or Group or Role (well, possibly in addition to), I want Item to have the appropriate Context so that only I, or the anyone with project A or project B can perform the appropriate actions.

In essence, this is what the AWS Account is - the context for any action. If you want someone to be able to make changes to any of the instances, you given them ec2:*Instance* (essentially). If you want them managing the network, you give them ec2:*VPC* (and a few others). The AWS documentation is very good at listing what actions are. Interesting that this leads to a functional siloed management approach which seems to be something that the cloud is supposed to be counter to (from rhetoric and commentary, not necessarily from reality) but that’s neither here nor there.

The Account is also a context for any given Item - you can’t (outside of some specific cases) share Items across accounts. An instance is only in one account. A network is only in one account.

The overlap of the Account as permission context and the Account as resource container starts to cause some friction. The problem is that a good amount of the time, you want to keep context separate for one Item type (e.g. instances), but keep a shared infrastructure together for another Item type (e.g. network). However, you have to go for one or the other. You can’t say “user multiple contexts for instances, but only one context for the network.”

Much like the Owner concept, you can try to apply a Context to AWS permissions - use a tag to try to match something. The problem is that the Context is not supplied at any time. You can have the tag be applied on boot (see prior comment about seed issue fragility), but you can’t check for that - there’s no $aws:context. It’s not automatically determined/tracked in CW, so can’t follow that mechanism. The concept is just not there.

While it’s applicable in various places, I see it coming up mostly on networking, but given the nature of what AWS is supporting, the network is the most shared component. Maybe a special case is due here to allow networks to span multiple AWS accounts, but that seems very unlikely. Or maybe you need to be able to specify “Item A is context specific” and “Item B is not context specific.”

Fundamentally, I don’t believe anything has changed - maybe some trick has been discovered, but you’re still chasing no contextual separation. Because of this, AWS Accounts are not meant to be multi-tenant. Because it’s hard to go back and rebuild the authz mechanisms based on a different model, I don’t see AWS Account tenancy changing. You can try to apply some layer on top of that, but you are fighting a fundamental concept that is not aligned with the common use case, so I’m not sure that that will take you very far.

07 Jan 2017, 21:15

Blogging is scary

You may have noticed that there’s not too much content here recently. At best, I have one article that’s in draft state and nothing else. Maybe some ideas for articles, but that’s about it right now. According to the last item here, the last piece published was 7-8 months ago. So, 2 articles per year is not exactly a great number.

The truth is that I’m afraid.

Blogging - like other forms of writing and expression - is very personal. You’re putting something out there with the hopes that others will a.) see it, and b.) like it. So, not only do you have a chance to fail, but you have a chance to fail twice. People can not even take the time to see it, or they can see it and find all of the flaws in it and let you know. And in the wonderful world of comments, they can let you know in the most delightful ways.

Side note: I have no plans to allow comments. Despite my idealism, I’m very disheartened by the current state of Internet commentary. I’m not sure how to deal with the fact that it gives a megaphone to the lowest of the low that doesn’t even people others should exist. I’m not going to be a part of that here.

I’m not too worried about the first failure. I’d like to make some joke about “Schrodinger’s anonymous: it’s both read an unread until you actually write it so people can choose to read it or not” but can’t say if that’s really accurate. What is accurate is that if I don’t write it, then I’ll fail to be read by default.

It’s the latter that is more a problem for me. I don’t like to be wrong. I don’t like to write something and have it there forever that I was wrong. It’s like that hastily written 3AM tweet that someone caught a screen cap of, or that’s already on the Internet Archive (which is a very good thing to have in this world) and you can’t simply dismiss it as if it never existed.

I have ego. It is fragile.

This has an easy fix - just make sure that everything I every do is perfect. See, easy.

No, it’s not a single thing that will fix this. Sometimes, it is taking the time and being maticulous with what I have expressed. Sometimes, it is going ahead thinking it’s right and not seeing what’s wrong. Sometimes, it is looking back and being willing to admit it was wrong. Sometimes, it is as simple as putting it out there and saying:

Here. I did this thing. It's got these good parts, and it's got these bad
parts. Does anyone else see anything that's good or bad with it?

But, as with other areas that I’m just really good at procrastinating, it’s about saying “let me start here, and then just keep starting.” I accept that I will fail at times, and that I will not want to. And that’s okay. I just can’t do that all of the time.

Courage is resistance to fear, mastery of fear--not absence of fear. Except
a creature be part coward, it is not a compliment to say he is brave; it is
merely a loose misapplication of the word.
     - Twain, "Pudd'nhead Wilson's Calendar"s

25 May 2016, 19:12

DevOps: Centered vs Bounded Community

And now for some navel gazing…

What makes DevOps DevOps?

I’m late to the discussion. There are some many different takes on what DevOps is, and defining it has been compared to blind men trying to describe an elephant (the image is from earlier, but I can’t find an earlier reference to it).

Most of the descriptions seem to be along the lines of “DevOps is a role that does X” or “if you do X, you’re doing DevOps.” It’s been about defining the boundary of what DevOps is and what it is not.

[Dan North]() had a great talk at Craft Conf. The overall talk is on embracing uncertainty but there’s a section where he talks about two classifications of communities:

  1. Centered communities are ones where there are principles which are aimed for. One’s involvement in the community is measured by trying to move towards the principles or away from them.*
  2. Bounded communities are ones where there are markers which indicate the edges of the community. One’s involvement in the community is measured by being inside the markers or outside of them.

This got me wondering. As I’ve said, most of the discussions I’ve seen talk about DevOps in a Bounded way. Does it make sense to talk about it in a Centered way? If we consider it from a Centered way, that means there’s a principle or multiple principles that we try to move towards. Those would include:

  • Automation: Methods as well as testing
  • Breaking down the silos: Understanding other roles and working together
  • Improvement: Looking for ways to learn and get better.

There’s more principles to include and depending on the discussion some principles or priorities for principles conflict with others, but I’m looking at the meta-discussion so am going to hand-wave on enumerating here.

There’s a kicker: principles can be phrased as boundaries. You can focus on “breaking down silos” or you can have a situation which has no silos. The former shows the striving towards that; the latter is a state (“is”). So, it may be hard to pick out, or this may all be an issue of the language we choose to us.

Let’s not forget the appeal of bounded. Practically bounded discussions provide more concrete items which people can try to learn and mimic (cargo cult maybe?). And Bounded discussions are more likely to create groups which will be incentivized to spread it (see Dan’s comments about the Agile communities and the Formal Processes which came out which could be sold). Centered discussions are more vague and can lead to an identity crisis.

Despite all of that, what I really like about the Centered way because it is a bit more inclusive. The bounded descriptions seem to create more of a “us vs them” situation and has lots of thems. This seems is counter to the “Breaking down silos” aspect.

I think that that’s a way we should be moving towards - at least, for my version of DevOps. I’m hoping that that is moving in the same direction as many others’ versions of DevOps. Maybe that’ll be enough for now…

19 May 2016, 20:30

I should know better...

Always keep a backup. That’s the motto, right? And that was mistake number 1.

Mistake number 2 was an rm -rf on the wrong thing. Knifes are sharp. Cut away, and stay in school.

On the plus side, the only major loss which was covered elsewhere was this blog. And I did have a “copy” of that in the S3 bucket. So… all I had to do was pull it back, reverse engineer it, use it as a comparison and pull stuff back in.

hugo lessons learned

  • It worked! Static site generators - I use Hugo - are nice for a lot reasons… this just cements that in my mind. So reproducible.
  • It’s been a while since I setup a site from scratch. Not that that’s a whole lot to it, but this time it just felt smoother. Not sure that it was my familiarity, or if it’s a bit more mature. Either way, pretty smooth.
  • When rebuilding this way - you gotta refrain from making all those fixes that you want to: no fixing typo - even small ones (e.g. whitespace cleanup); not fixing the input even if there’s a bug you worked around that is now fixed; no fixing broken links on remote sites that have bitrotted out; no fixing the tags/categories that have grown organically. All fixes make the static build not match and that breaks the rebuild down.
  • Unicode is hard - copying and pasting and seeing differences between editors over time all lead to unicode charaters (accented ‘, “, etc) sneaking in. This make it hard to compare because some are interpreted and some aren’t. I got into a consistent pattern “matching forward” until one post was completely the opposite, so trying to fix it was counter to what was supposed to happen. Sigh.
  • Line lengths are hard. When writing the blog article, I was trying to keep a line length limit to make it readable. However, HTML is honey badger - it doesn’t care. So copying stuff back and forth after it’s browser rendered - just isn’t fun.
  • Daylights Savings time makes guessing timestamps hard. Hugo renders without timestamp, and I didn’t feel like looking it up for every post, so I guessed. Should probably fix that (though, it doesn’t actually matter with how it’s rendered…).
  • Remember to clean up your public directory on occassion. If you change out templates or other static files… they don’t necessarily go away. Probably should remove public and rebuild regularly (like… every push).

So - you can do. It wasn’t too bad. Actually was pretty good. But given that I only have 25 posts, it wasn’t onerous. If there were more, I really would have done it a bit more automatically (probably should have with 25, but… eh…).

personal reminders

Kinda got 4 big things out of this one:

  1. Yea, (well past) time to get a backup on the laptop. Got one on the desktop machine, but most of the items on the laptop are tied to some service and backed up there… but not everything.
  2. I really should proof read my writing. I’m not good about this, and looking back over stuff, I really should be. I do sometimes do a quick once through on the writing, but most of the time, it comes out more raw than that. So… something to work on there.
  3. 2015 was not a pretty year. I didn’t write anything last year, and on top of that, some of the ones I thought I wrote last year are actually 2 years old. I know work was a bit demanding last year, but I’m still realizing how much it took out of me.
  4. Related to #3: I have a bunch ideas, a smaller number get written down, and an even smaller get something real put to them. I’d like to get a higher conversion ratio there.

08 May 2016, 22:22

X-Forwarded-For: You keep using that word...

I recently had a discussion around the X-Forwarded-For header and common usage. This wasn’t the first time I’ve had the discussion, and probably won’t be the last. I’m going to jot down some thoughts for future me and future others.

For the record, this is the perspective from running online services - so the focus is on the incoming requests.

tl;dr: X-Forwarded-For is not as standard as people believe, and even where it is standard, it’s not the standard that people think it is. And don’t use it for server-to-server calls - just for proxies.

Issues

I have to ask three questions every time I see it in use:

  1. Can I believe that the client (or middleman) was truthful about the value?
  2. Can I believe that nothing in the middle messed up the handling of the value?
  3. Is the service making the call in just a “proxy”?

For the first one, let’s go back to the first rule of internet fight club - don’t trust anything from the client. This can easily be spoofed, so I have to remember to strip it at my front door.

As for the second question, let’s just say that the chances are not insignificant that something isn’t handling it correctly. The current RFCs for HTTP 1.1 (2616 and the new proposal 7230) both allow for multiple headers as long as the header value is a list:

A sender MUST NOT generate multiple header fields with the same field
name in a message unless either the entire field value for that
header field is defined as a comma-separated list...

https://tools.ietf.org/html/rfc7230

Most make the assumption that it’s on one line, and it’s comma separated (and I even had a case where some assumed it was a single value). The truth of the matter is that there are plenty of bugs in well know projects which don’t handle this correctly. There are more bugs in internal code.

This second question might have been handled by rfc7239, but it seems to contradict rfc7230. On the one hand, it’s header format is no longer just a list; on the other hand, rfc7239 explicitly permits (CAN) multiple headers. So, my vote is still out here.

So, regarding the first two questions, at all of your inspection points (including the implicit ones which are easily overlooked), you have to make sure the XFF is being handled correctly. If not, it’s worse than useless; it’s dangerous.

The philosophical question…

The last question is a bit harder to noodle through. What does it mean to be “forwarded”? The context here is a lot of proxying of requests where it is expected that the proxy isn’t making a meaningful change to the request itself (adding some tracking headers, converting from HTTPS to HTTP, caching, etc).

This is different than the case where one service is calling another service. The source service isn’t really proxying the request as it’s making a new request on behalf of the client. Even in the newer RFCs, I haven’t found a clear definition of “proxy” so I don’t think there’s a formal answer.

This may be a subtle distinction, but the meaning has consequences for how you manage it. When doing controls, there usually can only be one source: one value that gets used to compare.

As a simple example, I only want requests from a specific geography to come in, and I’m servicing both clients and other services. I have to decide if I want that geo restriction to apply to the original client or the geo of the last connection, which could be a service. If I choose last connection, then I’m going to be shutting down a lot of clients in that geography because they are dependent on a service in another geography. If I choose clients, in addition to making sure the client info gets to me, I have to make sure that the services are good handling that split good/bad responses.

In the case of rate controls, I have to make the same decision, and that’s got its own issues that I probably want both - one rate control for end clients, and another loose rate control for partner services. Are you supposed to parse the XFF chain and figure out which came from which? Can you even apply some

This leads me to the mindset that XFF should only be used to show a source connection via proxy, and something else should be used for requests made as part of a services chain: “X-Requested-For” or something similar.

26 Feb 2016, 13:34

But I don't want multi-tenant networking

I’m in a bit of a conundrum at work.

It’s coming to the point where I need to put some formality around how everything talks to everything else. I’m merging three different network administrative domains (at least, there’s some partners that really are another curveball).

The question comes how - how do I bridge our internal network IP spaces?

I believe - and that’s a funny item to be looked at in a minute - that the principle of least surprise says that the network which engineers are wanting to use in the environment is one that is as flat as possible from the ip route perspective. It is the one where I can reach any other part of it (ignoring security policy) without having to think about it. When people ask “Can I get to that from here?” they usually are asking “Are there policy permits that let me get to there from here?” and not usually “Would a packet leaving me be routed in the right way to get there (and getting routes back)?”

Now, there is a new generation of engineers who are growing up “cloud native” and recognize that managing the IP space

This might just be me, so I should probably ask around… that’s a blog article on it’s own. Or holding this one…

Of course, there’s always IPv6. In theory, that solves everything. But that’s not a space that I’m going to see soon. Hopefully, I can be a midwife to usher that in, but that means handling both.