15 Apr 2017, 14:50

Cloud Providers as Lean Suppliers

“Move toward a single supplier for any one item, on a long-term relationship of loyalty and trust.” - Deming

One of the parts of Lean Manufacturing is that any optimization that only focuses on your is the same as any optimization that’s focused on one part the floor - “an illusion.” It’s a local optimization and doesn’t look at the system as a whole. You may not actually improve anything. There’s a pragmatic approach to it, so sometimes you focus on your sphere of control instead of just your sphere of influence, or you wait till “the right time,” but the point remains. You have to look at your supplier (or consumer) and see where optimizations there can make your delivery more optimal. This is so prevalent so that’s even books on it regarding Toyota’s, the Lean poster child, supply chain.

Now, being able to have that kind of influence on your supplier is a tricky prospect. Some simple thoughts is that it requires a relationship where you as the customer have a lot (maybe undue) influence on the supplier that the supplier is coerced into doing what you want, or that the supplier is so customer focused that they’ll do anything that you want. The simple thought with regards to Toyota falls into the former category. The not so simple thought is that this interaction hinges on a deep (not just lengthy, but to the degree of interaction) and mutually beneficial relationship. The key to a relationship is that both sides work on it and work a lot of give-and-take into it.

This is quite the opposite from a push to multiple suppliers that are all interchangeable. Establishing the relationship is secondary to being able to switch away from a supplier at any time for any reason.

The truth is that nothing is black and white, and different situations are at a different level of gray. One company may work very hard to be supplier-agnostic, while another may work very hard to establish deep relationships.

The same plays out in the IT arena - in multiple ways and multiple levels. You have long standing deep relationships with existing vendors, and you have senses of “this vendor is interchangeable.” Interchangeable is always iffy, as that depends on the commoditization of the both the technology and the process that uses it, but it is a goal that is longed for. This works down even to code level where one interface may have multiple implementations. Whether that that last logic is the source for that logic showing up in vendor management is anyone’s guess.

Today it’s an open question of how deep does an IT group get with a Cloud Provider. It’s not an easy statement to say that that should never be very deep. Even if you maintain a hard line of only relying on features that have parity across multiple vendors, you still have to make sure that the interface surface for managing those features are on par. There’s many types of lock in that we tend to underestimate the size of the interface surface (even to the point of ignoring many parts of it).

“But how do I maintain a relationship where I’m tied into a lot of them, but that I’m only a small part of their revenues?” Well, there’s no simple answer. The Cloud Vendors are some of the largest companies in the tech industry, and hard to hold any cards over. But they got where they are by having a huge focus on the customer (Amazon even claims to be the most customer-centric company on earth), so you can assume a bit of work to be done connecting with you.

The best advice I can see is “be pragmatic and mindful, and make sure you’re communicating.” There’s going to be issues maintaining that relationship, and there’s going to be risks with having that relationship. Don’t go into it expecting a silver bullet or that it will never be rocky. There will be bumps on the way, and there will be back-and-forths. The key is to know that you have to work at it, and make sure each side knows what that work is. And realize that sometimes it’s just necessary to lift a lot yourself to be able to move over to a new relationship.

16 Mar 2017, 20:31

CAD, not CAP

EDIT NOTE: this was incomplete but putting it out there anyways — almost 2 years later…

I’m exploring a few options on

Model names, instance names

First - have to apply actual instances of models. Using prefix for this:


The first item is the portability of models across the development cycle. I.e. how to use the same model applied to different environments. This largely comes down to the instances of the model and based on the previous article, would be prefixing the model with the environment name. As an example, if you have “Dev,” “Staging, and “Prod” environments from the simple 3 tier app model, you could end up with role assignments of:

devweb01: [dev::3tier::web]
stgweb01: [stg::3tier::web]
prdweb01: [prd::3tier::web]

IF dev and stg shared a database:

devdb: [dev::3tier::db,stg::3tier::db]
prddb: [prd::3tier::db]

side note

I still don’t have a good case for “this is running on port 8443 in nonprod, but running on port 443 in production.” This falls into two options that I see:

  1. Have an override that can be loaded in (similar to the model mapping).
  2. Removing all direct references to port or other information from the model, and having that be in specific mapping per environment.

The problem with the first is that it isn’t clean - you have to look in a couple of places to see what is the case. The problem with the second one is that you get some redundancy and you have to look in a second place (it’s always in the second place, not that it’s possibly in one of a couple of places). So, no good idea there.

CMS path

18 Feb 2017, 15:03

ANCL: Use Cases

I had a chance recently to revisit ANCL in two ways. I recently had to compare some firewall rules between two different firewalls that were setup to mirror each other. The original firewall was not set up using any modeling of firewall rules so very much fell under the issues that I originally commented on. Lessons learned:

  1. It’s much easier to collect/reason about the communication pattern when you look at it from the “what do I need to talk to” perspective instead of the “what talks to me” perspective. People are blocked when their downstreams aren’t working and so have a bit more motivation to make sure those are well described.
  2. Even JSON is a bit more verbose than that I wanted to deal with when working on the rules in mass.
  3. To make it happen, I simplified and didn’t attempt to build out any kind of hierarchy or dependency. Even though some would stem from the same model, I manually created the instantiation of those models. I don’t have a good approach for this yet, but working through the concrete example gave me a better understanding.
  4. It’s easy to use IP address in the context of firewalls, and you can overlap anything that has that address or containing CIDR.
  5. Naming is hard (1): There seems to be a bit of redundancy with model roles. If I have service, what do I specify for clients and what do I specify for the port which the clients connect to? In both cases, I want to use “client”
  6. Naming is hard (2): It’s still not clear to me what to use to describe the generic descriptions (e.g. models), the components in those descriptions (e.g. roles), and the instances of items in those roles (e.g. nodes?). I keep using the term roles in the place of the nodes - I think.

Separately from the firewall, I’ve been looking at using this to help figure out overall communication matters. I’m trying to bridge together different applications running by different groups and using different interconnect mechanisms. I need to get quality and slot information for what’s talking to what. That’s got me thinking. A few more items to postulate:

  1. Mental exercise: How does routing information play into all of this? Does different routing affect how the models are structured?
  2. Mental exercise: Can I use the models to influence aggregation and reporting on netflow data. Each netflow entry could be associated with a specific model which gives a lot more context than protocol, port, and subnet which ends up being the bulk of what I usually see?
  3. Mental exercise: What’s it looks like to add additional information to each model? Not just “443/tcp” but also “100Mbps”, “TLS”, and “client certificate required”?
  4. Mental exercise: In the first model, I associated roles to specific IPs. What’s it look like when instead of IPs, I use AWS instance IDs, or AWS security group IDs, or container IDs, processes, etc?

So, there’s a lot more interesting stuff beyond just the firewall, and it’ll be interesting to see what comes up. But I still worry about the complexity, so I want to figure out ways to reduce that complexity.

The first one is to not have specific models (“this application’s Oracle DB”) for everything and to be able to use more generic models (“Oracle DB communication”). This means having an ability to reference a model. I’m still not sure how to do that. So, I’m trying to take a step back and come up with some use cases to help noodle through this. So with that in mind, the remainder of this is about examining that. I’m not committing to anything so you’ll see possibly a few implementations below.

Use Cases

Simple 3 Tier Application

This is your classic three tier application.


A sample general model could look like:

  - [web,webapi]
  ingress: []
  - [app,appapi]
    webapi: [443,443,"tcp"]
  - [db,sqlnet]
    appapi: [8009,8009,"tcp"]
  egress: []
    sqlnet: [1521,1521,"tcp"]

Shared DB

This is the case of the same model being applied in two different context with one overlapping resource. The example is a shared DB resource (here shared between dev and prod, but probably shared across multiple DB)

prod/dev share db

A fully expanded model could look like:

  - [db,sqlnet]
  ingress: {}
  - [db,sqlnet]
  ingress: {}
  egress: []
    sqlnet: [1521,1521,"tcp"]

However, in reality, there’s a base model which looks like just:

  - [db,sqlnet]
  ingress: {}
  egress: []
    sqlnet: [1521,1521,"tcp"]

The question is really about how to relate multiples together. Looking at roles:

prod-app: ["prod::app"]
dev-app: ["dev::app"]
db: ["prod::db","dev::db"]

This works in this simple example, but I’m not sure it covers everything (see below).

Same model applied to node as two different roles

This is one that masks quite a bit so it’s not clear what the perfect setup is. The simple case is that there’s a DB that serves sqlnet, but in turn also connects to other DBs using sqlnet (e.g. replication).

main-db -> ro1-db -> ro2-db

This could looks like:

  - [db-server,sqlnet]
  ingress: {}
  egress: []
    sqlnet: [1521,1521,"tcp"]

main-db: [main2ro1::db-server]
ro1-db: [main2ro1::db-client,ro12ro2::db-server]
ro2-db: [ro12ro2::db-client]

The “db-server” and “db-client” part feels a bit weird. I kinda want to just have “server” and “client” but then feel like I need another name hierarchy - e.g. “db::client” and “db::server” - so the roles would look like:

main-db: [main2ro1::db::server]
ro1-db: [main2ro1::db::client,ro12ro2::db::server]
ro2-db: [ro12ro2::db::client]

This looks ok, but there’s two concern for me:

  1. How many things are “client” or “server” so would there be a way to simplify that?
  2. Having to have a context for all of the directed pairings seems a bit overdone. Is there a way to simplify that?

The latter concerns me more. Maybe not using the pairwise, and looking at the context to be a bit more on the node (in this case) itself:

main-db: [main::db::server]
ro1-db: [main::db::client, ro1::db::server]
ro2-db: [ro1::db::client]

Node as Multiple models

This is the case of having a node participate in multiple models. The example is that the node is part of its main role (app or db), but it’s also being monitored and logged into (so, “adminee” controlled by “adminbox”).

(insert app/db models above)
    ssh: [22,22,"tcp"]
    snmp: [161,161,"udp"]
  egress: []
  ingress: {}
    - [adminee,ssh]
    - [adminee,snmp]

With an example of multiple roles put together:

prod-app: ["prod::app","adminee"]

Overlapping attributes

This is more of “if there’s overlapping attributes it needs to get the roles of any roles that match that attribute.” Simple example of overlapping IP addresses/CIDRs:

"": [adminbox]
"": [adminee]

In this case, has both [adminbox,adminee]

Self Referential

Some models are a bit self referential. Nodes of the same role will talk to each other (cluster members). Nodes of the corresponding role (cluster members in differnt subsections of the cluster) will talk to each other in another way. The post child for this is Cassandra:

Cassandra Fun

So, a model might look like:

  - [server,binary]
  ingress: {}
  - [server,plain-gossip]
  - [remote-server,encrypted-gossip]
    binary: [9042,9042,"tcp"]
    plain-gossip: [7000,7000,"tcp"]
    encrypted-gossip: [7001,7001,"tcp"]
  - [local-server,encrypted-gossip]
    encrypted-gossip: [7001,7001,"tcp"]

And the roles might look like:

app-dc1: [dc1::cassandra::client]
app-db2: [dc2::cassandra::client]
cass-dc1: [dc1::cassandra::local-server,dc2::cassandra::remote-server]
cass-dc2: [dc2::cassandra::local-server,dc1::cassandra::remote-server]

I’m actually surprised by this model. It seems to be one of the cleanest but it’s also pretty complex. Feels like a trap but I’m not seeing it yet.

Uh… distinct items?

I’m having trouble describing this one, and a bit about reasoning about it.

The general idea is that there are cases where you need to have a general pattern, but replicated a lot of times with specific contexts. The simply example would be to have 30 nodes - each of which have a self referential pattern that only refers to them. This is kinda like the Cassandra situation with the subtle distinction each Cassandra node talks to all other Cassandra nodes and in this case each node would only talk to itself. Effectively each node is its own context for a role (for as ugly as that sounds) that follows the pattern.

There’s two practical answers for this right now:

  1. Since it’s self-referential, it actually is unlikely to be needed to be defined (most people can talk to themselves and processes are probably listening on localhost anyways - which has overlapping IP space and thar be dragons with trying to reason down that one right now).
  2. You can enumerate each as a separate context - this seems like a workaround, but it at least allows for it, just not efficiently.

So, that may be enough of a starting point.


I think that’s enough for now. Definitely something to help ponder through all of this…

17 Feb 2017, 07:08

CAD, not CAP

Not being Partition Tolerant just means that you’re not Distributed, so CAP can be read as CAD - this might help with reasoning about it.

There’s this thing running around called the CAP theorem by Eric Brewer. It’s meant to show you that you have to make a tradeoff when designing a system - like “you can be good, fast, or cheap; pick at most two.” You pick a spot somewhere in the CAP triangle.

CAP Triangle

To paraphrase:

  • “C” for Consistency: you get the most recent write or an error,
  • “A” for Availability: you always get an answer,
  • “P” for Partition tolerance: you can still talk to the system (outside-in) even if there’s internal communication issues. E.g. a Partition happens when a node in the US loses its communication to a node in Europe due to a DDoS attack that takes out provider internet access.

The theory is that you can only do 2 of these*. The AP system sacrifices consistency meaning that you can get different answers during/after a partition. The CP system sacrifices availability meaning that during a partition, some part of the system in unable to serve data. The AC system sacrifices… having a partition?

This is where the language does feel like it fits. The only way to choose a AC system is to not permit a partition. The tolerance isn’t being able to handle it when it arises so much as it’s allowing it to happen at all. The only current way to handle that (short of advances in quantum entanglement) is to have a monolith and not a distributed system. But the CAP theorem is only in the context of a distributed system.

Maybe the AC is a degenerate case of “when everything is functioning fine”, or it’s meant to handle the degenerate “distributed system of one.” I don’t know what Brewer’s original thought on this is, but it seems a bit off to handle this by calling it Partition Tolerance. It really feels like it shouldn’t be a part of it, and that you’re left with deciding where you want to be on the AC line of the triangle - which really just turns this into a line:

AC Line

Every distributed system has to figure out how it’s going to handle the inevitable Partition events that happen in it. It’s a fundamental property of distributed system.

So, next time you hear “Partition tolerant”, process that as “distributed” and see if that makes it easier to handle.

* Writer’s Note: The recently released Google Spanner claims a bit of being able to do all three. I haven’t looked at it yet, so maybe I’m wrong in my thought here.

08 Jan 2017, 20:45

Context Permissions

I recently revisited AWS permissions, and found that, even though they haven’t fixed a seemingly significant design flaw, they’ve institutionalized the work around. But even so, I’m sure I’m all in on what has been institutionalized as being what is commonly needed even if it is what is traditionally done.

I’m talking about multi-tenancy inside of an account.

Typical permission models are built around:

Principal X can do Action Y with Item Z.

The way you distinguish any of X, Y, or Z is variable. Most of the time, X is “this user” or “user with this property (e.g. in this group, with this role, with this property)” - arguably, “this user” is the degenerate cause of having “user with the property of ‘this ID’” but meh. Similarly, Y and Z can be “this specific one” or “one with this property.” In many permission models, there’s a special property of Item called Owner that called out or elevated (i.e. automatically applied).

AWS allows you to create user accounts for managing the AWS resources inside of an account, and use these user accounts in an X/Y/Z model. The model was originally built around keeping people to specific functional silos. Some example actions just inside of EC2 for instance management:


The first three operate just fine in this model as long as you can get the Item property (e.g. Owner) set correctly. The last one creates a seed issue for the first three: unless you know/work it at creation time, you (as the Principal) can’t go back and apply the Item property to a specific Item without being able to apply it to all Items, and without that the Item becomes orphaned.

AWS has built a lot around the tagging system, CloudWatch, and Lambda to allow for a work around. It starts by specifying that access is granted on the conditional of being equal to $aws:username. In short, it looks for the create events via CW, fires off a Lambda function that then applies the Owner tag. There are two concerns here: 1.) While this works, it feels fragile - I’m waiting for items to slip through the cracks and become orphaned. 2.) In the supplied work around, it is tied to the AWS IAM account that did it which is very limited. It would be much nicer to apply this to a Role (not doable from what I can tell, though maybe Lambda can reference the authentication logs and see what Role you switched into, but that seems unlikely), or if it applied this to a Group (more easily looked up, but which of your Groups).

In multi-tenancy, what you’re really looking for is context. Sometimes, I’m working in “my workspace;” other times, I’m working in “project A” or “project B.” Instead of an Owner or Group or Role (well, possibly in addition to), I want Item to have the appropriate Context so that only I, or the anyone with project A or project B can perform the appropriate actions.

In essence, this is what the AWS Account is - the context for any action. If you want someone to be able to make changes to any of the instances, you given them ec2:*Instance* (essentially). If you want them managing the network, you give them ec2:*VPC* (and a few others). The AWS documentation is very good at listing what actions are. Interesting that this leads to a functional siloed management approach which seems to be something that the cloud is supposed to be counter to (from rhetoric and commentary, not necessarily from reality) but that’s neither here nor there.

The Account is also a context for any given Item - you can’t (outside of some specific cases) share Items across accounts. An instance is only in one account. A network is only in one account.

The overlap of the Account as permission context and the Account as resource container starts to cause some friction. The problem is that a good amount of the time, you want to keep context separate for one Item type (e.g. instances), but keep a shared infrastructure together for another Item type (e.g. network). However, you have to go for one or the other. You can’t say “user multiple contexts for instances, but only one context for the network.”

Much like the Owner concept, you can try to apply a Context to AWS permissions - use a tag to try to match something. The problem is that the Context is not supplied at any time. You can have the tag be applied on boot (see prior comment about seed issue fragility), but you can’t check for that - there’s no $aws:context. It’s not automatically determined/tracked in CW, so can’t follow that mechanism. The concept is just not there.

While it’s applicable in various places, I see it coming up mostly on networking, but given the nature of what AWS is supporting, the network is the most shared component. Maybe a special case is due here to allow networks to span multiple AWS accounts, but that seems very unlikely. Or maybe you need to be able to specify “Item A is context specific” and “Item B is not context specific.”

Fundamentally, I don’t believe anything has changed - maybe some trick has been discovered, but you’re still chasing no contextual separation. Because of this, AWS Accounts are not meant to be multi-tenant. Because it’s hard to go back and rebuild the authz mechanisms based on a different model, I don’t see AWS Account tenancy changing. You can try to apply some layer on top of that, but you are fighting a fundamental concept that is not aligned with the common use case, so I’m not sure that that will take you very far.

08 May 2016, 22:22

X-Forwarded-For: You keep using that word...

I recently had a discussion around the X-Forwarded-For header and common usage. This wasn’t the first time I’ve had the discussion, and probably won’t be the last. I’m going to jot down some thoughts for future me and future others.

For the record, this is the perspective from running online services - so the focus is on the incoming requests.

tl;dr: X-Forwarded-For is not as standard as people believe, and even where it is standard, it’s not the standard that people think it is. And don’t use it for server-to-server calls - just for proxies.


I have to ask three questions every time I see it in use:

  1. Can I believe that the client (or middleman) was truthful about the value?
  2. Can I believe that nothing in the middle messed up the handling of the value?
  3. Is the service making the call in just a “proxy”?

For the first one, let’s go back to the first rule of internet fight club - don’t trust anything from the client. This can easily be spoofed, so I have to remember to strip it at my front door.

As for the second question, let’s just say that the chances are not insignificant that something isn’t handling it correctly. The current RFCs for HTTP 1.1 (2616 and the new proposal 7230) both allow for multiple headers as long as the header value is a list:

A sender MUST NOT generate multiple header fields with the same field
name in a message unless either the entire field value for that
header field is defined as a comma-separated list...


Most make the assumption that it’s on one line, and it’s comma separated (and I even had a case where some assumed it was a single value). The truth of the matter is that there are plenty of bugs in well know projects which don’t handle this correctly. There are more bugs in internal code.

This second question might have been handled by rfc7239, but it seems to contradict rfc7230. On the one hand, it’s header format is no longer just a list; on the other hand, rfc7239 explicitly permits (CAN) multiple headers. So, my vote is still out here.

So, regarding the first two questions, at all of your inspection points (including the implicit ones which are easily overlooked), you have to make sure the XFF is being handled correctly. If not, it’s worse than useless; it’s dangerous.

The philosophical question…

The last question is a bit harder to noodle through. What does it mean to be “forwarded”? The context here is a lot of proxying of requests where it is expected that the proxy isn’t making a meaningful change to the request itself (adding some tracking headers, converting from HTTPS to HTTP, caching, etc).

This is different than the case where one service is calling another service. The source service isn’t really proxying the request as it’s making a new request on behalf of the client. Even in the newer RFCs, I haven’t found a clear definition of “proxy” so I don’t think there’s a formal answer.

This may be a subtle distinction, but the meaning has consequences for how you manage it. When doing controls, there usually can only be one source: one value that gets used to compare.

As a simple example, I only want requests from a specific geography to come in, and I’m servicing both clients and other services. I have to decide if I want that geo restriction to apply to the original client or the geo of the last connection, which could be a service. If I choose last connection, then I’m going to be shutting down a lot of clients in that geography because they are dependent on a service in another geography. If I choose clients, in addition to making sure the client info gets to me, I have to make sure that the services are good handling that split good/bad responses.

In the case of rate controls, I have to make the same decision, and that’s got its own issues that I probably want both - one rate control for end clients, and another loose rate control for partner services. Are you supposed to parse the XFF chain and figure out which came from which? Can you even apply some

This leads me to the mindset that XFF should only be used to show a source connection via proxy, and something else should be used for requests made as part of a services chain: “X-Requested-For” or something similar.

26 Feb 2016, 13:34

But I don't want multi-tenant networking

I’m in a bit of a conundrum at work.

It’s coming to the point where I need to put some formality around how everything talks to everything else. I’m merging three different network administrative domains (at least, there’s some partners that really are another curveball).

The question comes how - how do I bridge our internal network IP spaces?

I believe - and that’s a funny item to be looked at in a minute - that the principle of least surprise says that the network which engineers are wanting to use in the environment is one that is as flat as possible from the ip route perspective. It is the one where I can reach any other part of it (ignoring security policy) without having to think about it. When people ask “Can I get to that from here?” they usually are asking “Are there policy permits that let me get to there from here?” and not usually “Would a packet leaving me be routed in the right way to get there (and getting routes back)?”

Now, there is a new generation of engineers who are growing up “cloud native” and recognize that managing the IP space

This might just be me, so I should probably ask around… that’s a blog article on it’s own. Or holding this one…

Of course, there’s always IPv6. In theory, that solves everything. But that’s not a space that I’m going to see soon. Hopefully, I can be a midwife to usher that in, but that means handling both.

13 Feb 2014, 01:20

What does The Cloud(tm) mean?

The Cloud ™ - it’s a term that is far too encompassing of too many concepts.

At first, I thought the problem with describing it was that it was like the image of 10 blind people trying to say what an elephant was by each describing the one part they could feel. The more I think about it, that doesn.t even do it. The focus of that description is all about the “physical” description, but we’ve ascribed so much more into what we think of as The Cloud ™. Not only do we talk about what it is, but also what it can do, and what it can allow others to do. It’d be the same as trying to describe how an elephant herd interacts, or how the use of domesticated elephants affected agriculture or helped win a war.

In short, it’s impact is just as and probably more important than just what it is. So, let’s look at both of those in turn.

Physically, the cloud is a combination of the multiple *aaSes that exist, but largely focused on Software, Infrastructure, and Platform. Disclosure: In my realm, I end up interacting with the latter two, so this is largely concerned with those. To be clear, I say Infrastructure-aaS and mean any product which provides an abstraction of compute, storage and networking, which allows a user to obtain resources in a low latency (ideally sub-minutes given with self-service and API interfaces) SLA. PaaS is similar to the above but focuses on the application container (e.g. servlet engine, dynamic web server backend, database) instead of infrastructure components. The Cloud ™ can be public or private, it can be outsourced or internal, and it can even be service organizations in addition to true services.

We add confusion because all of these are “physical” descriptions, and so we tend to first compare on that level. Many look at The Cloud ™ as a single solution (most of the time, it’s AWS, but it can also happen on the other side with internal solutions). But really, we want to agree on what aspects of those solutions are important and the trade off that those require.

So what are those aspects? What can The Cloud ™ enable? Well, in not particular order, and definitely not complete:

  • It can be a cash flow offset. It allows you to focus on leveled burn (operational expenses) rather than big bang spends with depreciation (capital expenditures). How much this matters depends how your corporate finances are structured.

  • It can provide dynamic resource commitments. You can purchase resources for short term usages. The dynamic capability leads to a need for rapidly providing and taking those away. How much this matters depends on your duty cycle, your bursts, and what margins are like with the provider.

  • It can provide rapid global ramp up of resources. From the last point, where you get those dynamic resources, you can choose where they go. How much this matters depends on your ability to configure those resources rapidly and the global properties of your application, as well as the provider capabilities (e.g. points of presence).

  • It can be automation point. Not talk about The Cloud ™ can happen without some aspect of automation. Every cloud is built upon it. Every interaction asks the question “how can we automate it?” How much this matters - well, it just matters. Your ability to execute on this drives how helpful it is.

  • It can change the semantics of application deployments. You move from talking about a build of an application or code package, and towards building (at least for now) machine images or container images (with application and dependent code inside). How much this matters depends on how you do your application configuration.

  • It can change the semantics of host and system management. You move from talking about individual hosts to talking about abstract roles or clusters. See Pets and Cattle.

  • It can provide you a way to level your production. If you’re not familiar with Heijunka, it’s a way to smooth out the flow of invetory through the delivery pipeline. Virtual environments enable you to provide the just the right resources just in time, by taking larger undifferentiated resources and honing them into what you need. Previously, you had to be very targeted and keep a lot of pre-differentiated products that can be used when need be. This leveling helps speed everything up without keeping around too much inventory. How much this matters depends on how many different resource types you really need, and how much overhead you’re willing to take.

  • It can let users take care of themselves. It can provide self-service in very structured ways. You can replace people and teams and service catalogs with APIs. Replace is probably the wrong word as someone or something needs to handle the underlying infrastructure of the service, and the service itself becomes a very codified service catalog. How much this matters depends on the level of responsibility being expected and accepted by the service users.

  • It can transfer work and risk to a third party. You can outsource what you deem to be noncritical and/or commoditized aspects of your business to others. The funny thing about risk is that it is rarely actually transferred. How much this matters depends on how tolerant of risk you are, how much you can negotiate, and how well you can handle this internally.

Ultimately, it’s a matter of gaining some level of real or perceived efficiency. That efficiency can come in the form of economic (as in using for bursting, or cash flow changes), or in the form of faster changes, or in the form of shifting responsibilities, or probably others.

A lot of the above can be achieved without using The Cloud ™, and many of the aspects run counter to each other (e.g. virtualization overhead versus flexibility). All in all that makes it impossible to say that The Cloud ™ is goal. The goal is ultimately to make money, but the question is which aspect(s) of The Cloud ™ do the best to get you that?

11 Jan 2014, 21:51

More than just Pets and Cattle

wIt’s been said many times many ways that cloud) servers should be treated like cattle, and not like pets. Looks like the first reference is Bias, but there are quite a few others: here here here here just the top ones on a google search. The main idea being that we had this tendency when the servers were fewer yet more longitudinal to treat them delicately: putting care and feeding into each of them; now that we (can) have large amounts of short lived instances, we can’t be bothered with the same care.

That’s a completely valid way of thinking (it’s a great place to be), so I’m curious as to where its limits are. In some ways, looking at just servers that way is looking at a point in time and capabilities and thought.

We’ve all had pet files. Remember that hand crafted config file that you spent days of your life tweaking to get it just right? Maybe it was specific to that host. At some point, you groomed it enough that it became a golden file for your entire environment and you could copy it and push it out to all of the other servers. Then you pushed it out using some higher level config management system. Then you moved up some semantic level and the file itself got abstracted into specific resources, and those were composited and pushed out. So, files started as pets, and by realizing that the file was only a model of something that we actually cared about, they moved to cattle.

Really, pets are pets because you’ve become attached to them - you can’t clone them, and it hurts to lose them. Cattle is cattle because it’s easy to get another and it’s not a big deal if you lose it. There’s a lot of different specific means to achieve these, but it’s these two fundamental classes of properties that enable this thinking:

1.) It’s easy to copy, and 2.) It’s easy to handle losing it (enter whatever you want to say about antifrigilness here).

But thinking about files and servers is so the 2000-noughts. What are our pets now?

Moving up from the server, is the cluster. Are clusters now the new pets? or can we treat them as cattle as well? Given sufficiently large IaaS services and strong configuration management systems and lots of variable substitution (well, probably more like locally realized global patterns), it’s actually fairly easy to fulfill property #1 above - copying. As for #2, if you have sufficient global load balancing of any form (DNS, anycast, etc), you can easily route traffic to working clusters, or more precisely, away from failing (lost) clusters.

So, pulling further out, our clusters collapse into a service. Is that our new pet? With even more config and *aaS and some client service discovery (aka any sufficiently advanced delivery model), you can certainly copy it. Though, if you lose your source code, it would definitely take a bit to reproduce the service (get all those coders together again, etc). What about losing it? Well, if you are a single feature service inside of a larger service, you might be able to be disabled, so you can lose it. But what about that larger service? I think for most businesses, you can’t just lose it.

So, that’s your pet.


(One could examine businesses and business models and plans and use the same comparisons, but I think this first point - what makes something pet versus cattle across various object domains is copying and dealing with lose - is done well enough, so my second point…).

There’s another way to slice (heh) this metaphor: milk. Not all cattle is used for steak. Some cattle is used to produce a product, bulked up again, then produce more of the same product. That cycle time might be a little too short, so the metaphor might make a little more sense by using different livestock - sheep. Some sheep are raise for mutton, some sheep are raise for wool (and yes, you can do both, but still). For the wool sheep, after the wool is reaped each year, you have to let it grow out again before you can reap it again, all the while caring for the sheep. The sheep itself stays around, but you continue to reuse it.

That being said, you can use other sheep for the same purpose because lots of wool is the same; and sheep have their own way to easily copy each other well enough.

But you still don’t really want to lose a sheep. You still gotta deal with it going away and getting the replacement there. The same really applies to larger services (or businesses) - maybe you can copy it, but you really don’t want to deal with it going away.

So, my second point is really that there’s a third category between pets (hard to copy, hard to deal with loss) and (steak) cattle (easy to copy, easy to deal with loss), and that’s of the milk cattle (easy to copy, but still hard to deal with lose). This last category by its very nature persists and is modified, rather than being destroyed and rebuilt each time. All of those things that we had to think about for when we wanted to change our pets still apply. Maybe it’s not to servers, but the lessons learned are still valuable.

And lastly, not everyone is there. And not everyone who is there is there for everything that they do (there’s probably a mix of services made of cattle and services made of pets in a lot of organizations). So don’t feel bad. Just figure out which one it should be and work to improve.

PS Interesting enough, if we do the combinations of the above, there’s the last class: a service which is hard to copy, but you can deal with failure. I’m not really sure what that looks like, so I’m going to leave it as an exercise for the reader. I’d be curious if anyone comes up with something interesting. Contact me.

13 Mar 2013, 20:38

A tale of two PaaSes

I spend a good amount of time trying to figure out if my operational team can do much to make the general engineering efforts more productive. We’ve followed the usual turns around self-service IaaS and the like, and we’re now exploring the next level of Platforms-as-a-Service. In exploring the options, I’m seeing two large patterns.

On one hand, there are the “middleware centric and injection based” PaaS models. These are the ones where the developer picks a development middleware (Java Servlet, PHP, Node, Rails, etc), and adds other parts in. As if by an after thought, a static file service is added, or maybe a data persistance (i.e. database) service is added. On a implementation level, these usually involve allocating some compute and storage resource (e.g. a VM), installing the middleware container, doing a baseline install of the add-ons, and starting them all up inside that VM. There are some other configuration items such as pointing it at some version control repository, but also the developer is able to login onto the VM via shell.

On another hand, there’s a “service focused” PaaS model. This feels like the lesser named PaaS, though it probably has a larger install because this is the model that AWS largely is. In this case, the developer picks different service components (e.g. DB, cache, messaging bus, etc) and composites them a bit more independently. Underneath the control layer, each of the component providers can implement their services in different ways - using different VMs, processes, or internal containers (e.g. DB schemas w/ authnz) - based on what makes sense for that provider. There’s more work for the developer here as they have to compose services across different providers, and the developer doesn’t have direct access to the underlying system, but in exchange, might have better options.

From an implementor’s perspective, I think the service focused model is easier to maintain. This may not necessarily be the right reason to go down that route, but when it comes to delivery, that matters a lot. It’s also a bit more transportable - at this point in the industry’s lifecycle, it’d be easier to migrate from one IaaS or traditional Infrastructure to another. It’s also easier to extend this model to other (traditional?) services such as monitoring. You can see this in the industry - there’s many different service providers focusing on a narrow niche offering around one specific service, but fewer middleware centric vendors and even those that exist tend to also include some service based model for the add-ons.

As I said, most of the traditional services called PaaS are the form. So, what makes the application middleware so much different than a data store? or a caching layer? Fundamentally, you have some level of “service” which you want to present a clean interface to. This is true for the database as well as for a java servlet container, yet somehow we treat them a little differently in our heads. The only reason I can imagine is that is where time is spent. As a developer, I spend most of my time in the code, so that’s where my mind goes. But while I run it, I want to have a better idea of how it fits in with the other component services.

I think the vote is still out on which way has better long term viability. And it may never be decided on. It may just be a matter of preference.

Maybe PaaS isn’t the right term for this second model. These are more Services-as-a-Service, which seems likely to be a great way to confuse people. Mmaybe they’re more along the lines of Infrastructure, and are just a different take on that. I’ll admit, I’m not sure what the right way to refer to them is, but I believe that the use case that they present is more than just an implementor’s fancy. It’s a valid use case based on how developers are expecting to work with it.