-
Notifications
You must be signed in to change notification settings - Fork 0
Wildish-LBL/condor_annex
Folders and files
Name | Name | Last commit message | Last commit date | |
---|---|---|---|---|
Repository files navigation
OVERVIEW We use a CloudFormation template to help track all the moving parts (and clean them up when we're done). The moving parts include: The lease: * A CloudWatch metric, specific to each annex, whose namespace includes the stack's name. * A CloudWatch alarm, specific to each annex, which triggers when the custom metric becomes 0 or misses "enough" updates. * An SNS topic, specific to each annex, which ties to the preceding trigger to the following Lambda function. * A Lambda function, the same for all annexes, triggered by the alarm. The Lambda function checks if the alarm was properly triggered (to work around an AWS bug where the alarm is evaluated on an update before the update is evaluated): if so, it deletes the stack, otherwise it resets the alarm. The instances: * One or more AutoScaling Groups. (Up to eight, given the current limitations of the template engine.) * One LaunchConfiguration per AutoScaling Group, each created in its own substack (the substack is only used as a template engine). * A Security Group... Some plumbing: * A Lambda function, the same for all annexes, invoked as a custom resource, that adjusts the permissions on the previous Lambda function so that the SNS topic can invoke it when the alarm triggers and calls it. * A Lambda function, the same for all annexes, invoked as a custom resource, that inserts a data point into the lease metric. (The alarm will otherwise never leave its initial INSUFFICIENT (data) state.) * Three IAM roles, one for each Lambda function, each the same for all annexes. * An IAM Role and IAM Role Profile to grant instances permission to read from the private S3 bucket we created to share security tokens. LAUNCH CONFIGURATION Amazon does some things that make it very difficult to to automatically generate launch configurations: * Some instance types don't exist in some availability zones. * Some instance types may require VPC. You'd think we could work around this, but we need subnet IDs... and a subnet is availability-zone specific. Of course, we don't know the names of the availability zones -- or even the count -- since they vary from region to region. We could create mappings, but how do we create loops? The subnet resource doesn't allow a list in the availability zone parameter. Could we create another custom resource whose Lambda function creates a subnet in each AZ for the region it runs in? Even if we could, how would we create the launch group(s)? * AMIs vary by availability zone. This can be worked around using mappings -- see Amazon's standard example templates -- but it makes everything harder. In the glorious future, when we have our own AMIs, we can use the Amazon trick to choose one, but it's wildly unclear how to deal with the VPC subnet requirements and launch group variability. [Update: what we do now is list all selected VPC subnets in each launch configuration and let Amazon figure it out. This works as far as it goes, but of course requires VPC. (Do any instance types require EC2 Classic?) This doesn't help with the problem of some instance types not being available in some AZs, but since VPC subnets are tied to AZs, each AutoScaling Group is at least "trying" to launch its instance types in each AZ.] DEPENDENCIES Note that the "DependsOn" attributes exist to force the LeaseFunctionRole to be deleted last, because once it's deleted, CloudFormation no longer has the permissions to delete anything else. FIXME [] We need to look into which users on the machine (current set up has user nobody running jobs) can act as the role associated with the machine, e.g., download the credentials from S3. [] If condor_annex fails to create the stack, it should check once more if the stack already exists (to help catch simultaneous uses). FEATURE REQUESTS * When creating an AWS::SNS::Topic resource with a Lamba-protocol subscription, automatically add permissions on the referenced function to allow that topic to Invoke([.*]?) it. * The Cloud Formation web console should check parameter validity earlier and write out constraint failure messages next to the corresponding parameter. * Add an intrinsic function to refer to arbitrary parameters in other resources. * An AWS::Lambda::Function should be able to specify the duration of its logs. It's really hard to do this right now, because the log won't actually exist until the Lambda function runs, which could happen at any time. This may actually be a feature request for CloudWatch -- I'd like to set the default retention policy for a log group based on its name (and/or source). * Delete generated subscriptions with the rest of the stack. * Access to the ARN of a security group. (Using Fn::Join is awkward.) * Deletion order dependency. In particular, if I create an IAM role for a Lambda function, it has to be the last thing deleted, because otherwise CloudFormation loses the privileges to finish the deletion. However, that IAM role should have its resources restricted based on resources created for the stack... - I can't grant an instance permission to detach only itself, or even any instance only from its own autoscaling group. :( * [2016-02-18] CloudFormation doesn't (yet) allow per-instance type bids when creating a Spot Fleet. * [2016-02-18] It would be nice to be able to create Spot Fleets with an initial size of 0. * [2016-02-18] It would be really nice to be able to specify defaults with functions (Fn::Join, Fn::Ref, and Fn::FindInMap in particular). This would be doubly nice with a function to look up arbitrary values from any resource in the template (see above). * [2016-02-18] foreach() would be nice for use with comma-separated lists. * [2016-02-18] It would be nice if AWS::NoValue disappeared from inside object lists, instead of giving Spot Fleet fits. * [2016-02-18] Some of these issues may be better resolved by allowing me to add (existing) resources to a stack, rather than predeclare them all. * [2016-02-18] Being able to specify a periodic timer (CloudWatch Event) in CloudFormation would be really nice. ? [2016-02-18] Specify propogated user tag(s) for the stack. (Currently not propogated to EBS volumes.) Unclear how or if possible to tag the whole stack. * [2016-03-11] In particular, it looks like a tagged autoScaling Group or Spot Fleet that propogates tags to its instances doesn't tag the Spot instance requests by which it acquires those instances. NOTES * I could possibly use code to create the log group each Lambda function will by convention create, and set the retention policy there? * I would like to narrow the resource permissions granted to the IAM roles, but some actions don't support that, and others cause circular dependency problems -- see above.
About
There are many clouds like it, but this one is (not) mine.
Resources
Stars
Watchers
Forks
Releases
No releases published
Packages 0
No packages published