FreeIPA Identity Management planet - technical blogs

January 18, 2019

William Brown

Structuring Rust Transactions

Structuring Rust Transactions

I’ve been working on a database-related project in Rust recently, which takes advantage of my concurrently readable datastructures. However I ran into a problem of how to structure Read/Write transaction structures that shared the reader code, and container multiple inner read/write types.

Some Constraints

To be clear, there are some constraints. A “parent” write, will only ever contain write transaction guards, and a read will only ever contain read transaction guards. This means we aren’t going to hit any deadlocks in the code. Rust can’t protect us from mis-ording locks. An additional requirement is that readers and a single write must be able to proceed simultaneously - but having a rwlock style writer or readers behaviour would still work here.

Some Background

To simplify this, imagine we have two concurrently readable datastructures. We’ll call them db_a and db_b.

struct db_a { ... }

struct db_b { ... }

Now, each of db_a and db_b has their own way to protect their inner content, but they’ll return a DBWriteGuard or DBReadGuard when we call respectively.

impl db_a {
    pub fn read(&self) -> DBReadGuard {

    pub fn write(&self) -> DBWriteGuard {

Now we make a “parent” wrapper transaction such as:

struct server {
    a: db_a,
    b: db_b,

struct server_read {
    a: DBReadGuard,
    b: DBReadGuard,

struct server_write {
    a: DBWriteGuard,
    b: DBWriteGuard,

impl server {
    pub fn read(&self) -> server_read {
        server_read {

    pub fn write(&self) -> server_write {
        server_read {

The Problem

Now the problem is that on my server_read and server_write I want to implement a function for “search” that uses the same code. Search or a read or write should behave identically! I wanted to also avoid the use of macros as the can hide issues while stepping in a debugger like LLDB/GDB.

Often the answer with rust is “traits”, to create an interface that types adhere to. Rust also allows default trait implementations, which sounds like it could be a solution here.

pub trait server_read_trait {
    fn search(&self) -> SomeResult {
        let result_a =;
        let result_b =;
        SomeResult(result_a, result_b)

In this case, the issue is that &self in a trait is not aware of the fields in the struct - traits don’t define that fields must exist, so the compiler can’t assume they exist at all.

Second, the type of self.a/b is unknown to the trait - because in a read it’s a “a: DBReadGuard”, and for a write it’s “a: DBWriteGuard”.

The first problem can be solved by using a get_field type in the trait. Rust will also compile this out as an inline, so the correct thing for the type system is also the optimal thing at run time. So we’ll update this to:

pub trait server_read_trait {
    fn get_a(&self) -> ???;

    fn get_b(&self) -> ???;

    fn search(&self) -> SomeResult {
        let result_a = self.get_a().search(...); // note the change from self.a to self.get_a()
        let result_b = self.get_b().search(...);
        SomeResult(result_a, result_b)

impl server_read_trait for server_read {
    fn get_a(&self) -> &DBReadGuard {
    // get_b is similar, so ommitted

impl server_read_trait for server_write {
    fn get_a(&self) -> &DBWriteGuard {
    // get_b is similar, so ommitted

So now we have the second problem remaining: for the server_write we have DBWriteGuard, and read we have a DBReadGuard. There was a much longer experimentation process, but eventually the answer was simpler than I was expecting. Rust allows traits to have Self types that define traits themself.

So provided that DBReadGuard and DBWriteGuard both implement “DBReadTrait”, then we can have the server_read_trait have a self type that enforces this. It looks something like:

pub trait DBReadTrait {
    fn search(&self) -> ...;

impl DBReadTrait for DBReadGuard {
    fn search(&self) -> ... { ... }

impl DBReadTrait for DBWriteGuard {
    fn search(&self) -> ... { ... }

pub trait server_read_trait {
    type GuardType: DBReadTrait; // Say that GuardType must implement DBReadTrait

    fn get_a(&self) -> &Self::GuardType; // implementors must return that type implementing the trait.

    fn get_b(&self) -> &Self::GuardType;

    fn search(&self) -> SomeResult {
        let result_a = self.get_a().search(...);
        let result_b = self.get_b().search(...);
        SomeResult(result_a, result_b)

impl server_read_trait for server_read {
    fn get_a(&self) -> &DBReadGuard {
    // get_b is similar, so ommitted

impl server_read_trait for server_write {
    fn get_a(&self) -> &DBWriteGuard {
    // get_b is similar, so ommitted

This works! We now have a way to write a single “search” type for our server read and write types. In my case, the DBReadTrait also uses a similar technique to define a search type shared between the DBReadGuard and DBWriteGuard.

January 18, 2019 02:00 PM

SUSE Open Build Service cheat sheet

SUSE Open Build Service cheat sheet

Part of starting at SUSE has meant that I get to learn about Open Build Service. I’ve known that the project existed for a long time but I have never had a chance to use it. So far I’m thoroughly impressed by how it works and the features it offers.

As A Consumer

The best part of OBS is that it’s trivial on OpenSUSE to consume content from it. Zypper can add projects with the command:

zypper ar obs://<project name> <repo nickname>
zypper ar obs://network:ldap network:ldap

I like to give the repo nickname (your choice) to be the same as the project name so I know what I have enabled. Once you run this you can easily consume content from OBS.

Package Management

As someone who has started to contribute to the suse 389-ds package, I’ve been slowly learning how this work flow works. OBS similar to GitHub/Lab allows a branching and request model.

On OpenSUSE you will want to use the osc tool for your workflow:

zypper in osc

You can branch from an existing project to make changes with:

osc branch <project> <package>
osc branch network:ldap 389-ds

This will branch the project to my home namespace. For me this will land in “home:firstyear:branches:network:ldap”. Now I can checkout the content on to my machine to work on it.

osc co <project>
osc co home:firstyear:branches:network:ldap

This will create the folder “home:…:ldap” in the current working directory.

From here you can now work on the project. Some useful commands are:

Add new files to the project (patches, new source tarballs etc).

osc add <path to file>
osc add feature.patch
osc add new-source.tar.xz

Edit the change log of the project (I think this is used in release notes?)

osc vc

Build your changes locally matching the system you are on. Packages normally build on all/most OpenSUSE versions and architectures, this will build just for your local system and arch.

osc build

Commit your changes to the OBS server, where a complete build will be triggered:

osc commit

View the results of the last commit:

osc results

Enable people to use your branch/project as a repository. You edit the project metadata and enable repo publishing:

osc meta prj -e <name of project>
osc meta prj -e home:firstyear:branches:network:ldap

# When your editor opens, change this section to enabled (disabled by default):
  <enabled />

NOTE: In some cases if you have the package already installed, and you add the repo/update it won’t install from your repo. This is because in SUSE packages have a notion of “vendoring”. They continue to update from the same repo as they were originally installed from. So if you want to change this you use:

zypper [d]up --from <repo name>

You can then create a “request” to merge your branch changes back to the project origin. This is:

osc sr

So far this is as far as I have gotten with OBS, but I already appreciate how great this work flow is for package maintainers, reviewers and consumers. It’s a pleasure to work with software this well built.

January 18, 2019 02:00 PM

January 01, 2019

William Brown

Useful USG pro 4 commands and hints

Useful USG pro 4 commands and hints

I’ve recently changed from a FreeBSD vm as my router to a Ubiquiti PRO USG4. It’s a solid device, with many great features, and I’m really impressed at how it “just works” in many cases. So far my only disappointment is lack of documentation about the CLI, especially for debugging and auditing what is occuring in the system, and for troubleshooting steps. This post will aggregate some of my knowledge about the topic.

Current config

Show the current config with:

mca-ctrl -t dump-cfg

You can show system status with the “show” command. Pressing ? will cause the current compeletion options to be displayed. For example:

# show <?>
arp              date             dhcpv6-pd        hardware


The following commands show the DNS statistics, the DNS configuration, and allow changing the cache-size. The cache-size is measured in number of records cached, rather than KB/MB. To make this permanent, you need to apply the change to config.json in your controllers sites folder.

show dns forwarding statistics
show system name-server
set service dns forwarding cache-size 10000


You can see and aggregate of system logs with

show log

Note that when you set firewall rules to “log on block” they go to dmesg, not syslog, so as a result you need to check dmesg for these.

It’s a great idea to forward your logs in the controller to a syslog server as this allows you to aggregate and see all the events occuring in a single time series (great when I was diagnosing an issue recently).


To show the system interfaces

show interfaces

To restart your pppoe dhcp6c:

release dhcpv6-pd interface pppoe0
renew dhcpv6-pd interface pppoe0

There is a current issue where the firmware will start dhcp6c on eth2 and pppoe0, but the session on eth2 blocks the pppoe0 client. As a result, you need to release on eth2, then renew of pppoe0

If you are using a dynamic prefix rather than static, you may need to reset your dhcp6c duid.

delete dhcpv6-pd duid

To restart an interface with the vyatta tools:

disconnect interface pppoe
connect interface pppoe


I have setup customised OpenVPN tunnels. To show these:

show interfaces openvpn detail

These are configured in config.json with:

# Section: config.json - interfaces - openvpn
    "vtun0": {
            "encryption": "aes256",
            # This assigns the interface to the firewall zone relevant.
            "firewall": {
                    "in": {
                            "ipv6-name": "LANv6_IN",
                            "name": "LAN_IN"
                    "local": {
                            "ipv6-name": "LANv6_LOCAL",
                            "name": "LAN_LOCAL"
                    "out": {
                            "ipv6-name": "LANv6_OUT",
                            "name": "LAN_OUT"
            "mode": "server",
            # By default, ubnt adds a number of parameters to the CLI, which
            # you can see with ps | grep openvpn
            "openvpn-option": [
                    # If you are making site to site tunnels, you need the ccd
                    # directory, with hostname for the file name and
                    # definitions such as:
                    # iroute
                    "--client-config-dir /config/auth/openvpn/ccd",
                    "--keepalive 10 60",
                    "--user nobody",
                    "--group nogroup",
                    "--proto udp",
                    "--port 1195"
            "server": {
                    "push-route": [
                    "subnet": ""
            "tls": {
                    "ca-cert-file": "/config/auth/openvpn/vps/vps-ca.crt",
                    "cert-file": "/config/auth/openvpn/vps/vps-server.crt",
                    "dh-file": "/config/auth/openvpn/dh2048.pem",
                    "key-file": "/config/auth/openvpn/vps/vps-server.key"

January 01, 2019 02:00 PM

The idea of CI and Engineering

The idea of CI and Engineering

In software development I see and interesting trend and push towards continuous integration, continually testing, and testing in production. These techniques are designed to allow faster feedback on errors, use real data for application testing, and to deliver features and changes faster.

But is that really how people use software on devices? When we consider an operation like google or amazon, this always online technique may work, but what happens when we apply a continous integration and “we’ll patch it later” mindset to devices like phones or internet of things?

What happens in other disciplines?

In real engineering disciplines like aviation or construction, techniques like this don’t really work. We don’t continually build bridges, then fix them when they break or collapse. There are people who provide formal analysis of materials, their characteristics. Engineers consider careful designs, constraints, loads and situations that may occur. The structure is planned, reviewed and verified mathematically. Procedures and oversight is applied to ensure correct building of the structure. Lessons are learnt from past failures and incidents and are applied into every layer of the design and construction process. Communication between engineers and many other people is critical to the process. Concerns are always addressed and managed.

The first thing to note is that if we just built lots of scale-model bridges and continually broke them until we found their limits, this would waste many resources to do this. Bridges are carefully planned and proven.

So whats the point with software?

Today we still have a mindset that continually breaking and building is a reasonable path to follow. It’s not! It means that the only way to achieve quality is to have a large test suite (requires people and time to write), which has to be further derived from failures (and those failures can negatively affect real people), then we have to apply large amounts of electrical energy to continually run the tests. The test suites can’t even guarantee complete coverage of all situations and occurances!

This puts CI techniques out of reach of many application developers due to time and energy (translated to dollars) limits. Services like travis on github certainly helps to lower the energy requirement, but it doesn’t stop the time and test writing requirements.

No matter how many tests we have for a program, if that program is written in C or something else, we continually see faults and security/stability issues in that software.

What if we CI on … a phone?

Today we even have hardware devices that are approached as though they “test in production” is a reasonable thing. It’s not! People don’t patch, telcos don’t allow updates out to users, and those that are aware, have to do custom rom deployment. This creates an odd dichomtemy of “haves” and “haves not”, of those in technical know how who have a better experience, and the “haves not” who have to suffer potentially insecure devices. This is especially terrifying given how deeply personal phones are.

This is a reality of our world. People do not patch. They do not patch phones, laptops, network devices and more. Even enterprises will avoid patching if possible. Rather than trying to shift the entire culture of humans to “update always”, we need to write software that can cope in harsh conditions, for long term. We only need to look to software in aviation to see we can absolutely achieve this!

What should we do?

I believe that for software developers to properly become software engineers we should look to engineers in civil and aviation industries. We need to apply:

  • Regualation and ethics (Safety of people is always first)
  • Formal verification
  • Consider all software will run long term (5+ years)
  • Improve team work and collaboration on designs and development

The reality of our world is people are deploying devices (routers, networks, phones, lights, laptops more …) where they may never be updated or patched in their service life. Even I’m guilty (I have a modem that’s been unpatched for about 6 years but it’s pretty locked down …). As a result we need to rely on proof that the device can not fail at build time, rather than patch it later which may never occur! Putting formal verification first, and always considering user safety and rights first, shifts a large burden to us in terms of time. But many tools (Coq, fstar, rust …) all make formal verification more accessible to use in our industry. Verifying our software is a far stronger assertion of quality than “throw tests at it and hope it works”.

Over time our industry will evolve, and it will become easier and more cost effective to formally verify than to operate and deploy CI. This doesn’t mean we don’t need tests - it means that the first line of quality should be in verification of correctness using formal techniques rather than using tests and CI to prove correct behaviour. Tests are certainly still required to assert further behavioural elements of software.


Over time our industry must evolve to put the safety of humans first. To achive this we must look to other safety driven cultures such as aviation and civil engineering. Only by learning from their strict disciplines and behaviours can we start to provide software that matches behavioural and quality expectations humans have.

January 01, 2019 02:00 PM

December 30, 2018

William Brown

Nextcloud and badrequest filesize incorrect

Nextcloud and badrequest filesize incorrect

My friend came to my house and was trying to share some large files with my nextcloud instance. Part way through the upload an error occurred.

"Exception":"Sabre\\DAV\\Exception\\BadRequest","Message":"expected filesize 1768906752 got 1768554496"

It turns out this error can be caused by many sources. It could be timeouts, bad requests, network packet loss, incorrect nextcloud configuration or more.

We tried uploading larger files (by a factor of 10 times) and they worked. This eliminated timeouts as a cause, and probably network loss. Being on ethernet direct to the server generally also helps to eliminate packet loss as a cause compared to say internet.

We also knew that the server must not have been misconfigured because a larger file did upload, so no file or resource limits were being hit.

This also indicated that the client was likely doing the right thing because larger and smaller files would upload correctly. The symptom now only affected a single file.

At this point I realised, what if the client and server were both victims to a lower level issue? I asked my friend to ls the file and read me the number of bytes long. It was 1768906752, as expected in nextcloud.

Then I asked him to cat that file into a new file, and to tell me the length of the new file. Cat encountered an error, but ls on the new file indeed showed a size of 1768554496. That means filesystem corruption! What could have lead to this?


Apple’s legacy filesystem (and the reason I stopped using macs) is well known for silently eating files and corrupting content. Here we had yet another case of that damage occuring, and triggering errors elsewhere.

Bisecting these issues and eliminating possibilities through a scientific method is always the best way to resolve the cause, and it may come from surprising places!

December 30, 2018 02:00 PM

December 20, 2018

William Brown

Identity ideas …

Identity ideas …

I’ve been meaning to write this post for a long time. Taking half a year away from the 389-ds team, and exploring a lot of ideas from other projects has led me to come up with some really interesting ideas about what we do well, and what we don’t. I feel like this blog could be divisive, as I really think that for our services to stay relevant we need to make changes that really change our own identity - so that we can better represent yours.

So strap in, this is going to be long …

What’s currently on the market

Right now the market for identity has two extremes. At one end we have the legacy “create your own” systems, that are build on technologies like LDAP and Kerberos. I’m thinking about things like 389 Directory Server, OpenLDAP, Active Directory, FreeIPA and more. These all happen to be constrained heavily by complexity, fragility, and administrative workload. You need to spend months to learn these and even still, you will make mistakes and there will be problems.

At the other end we have hosted “Identity as a Service” options like Azure AD and Auth0. These have very intelligently, unbound themself from legacy, and tend to offer HTTP apis, 2fa and other features that “just work”. But they are all in the cloud, and outside your control.

But there is nothing in the middle. There is no option that “just works”, supports modern standards, and is unhindered by legacy that you can self deploy with minimal administrative fuss - or years of experience.

What do I like from 389?

  • Replication

The replication system is extremely robust, and has passed many complex tests for cases of eventual consistency correctness. It’s very rare to hear of any kind of data corruption or loss within our replication system, and that’s testament to the great work of people who spent years looking at the topic.

  • Performance

We aren’t as fast as OpenLDAP is 1 vs 1 server, but our replication scalability is much higher, where in any size of MMR or read-only replica topology, we have higher horizontal scaling, nearly linear based on server additions. If you want to run a cloud scale replicated database, we scale to it (and people already do this!).

  • Stability

Our server stability is well known with administrators, and honestly is a huge selling point. We see servers that only go down when administrators are performing upgrades. Our work with sanitising tools and the careful eyes of the team has ensured our code base is reliable and solid. Having extensive tests and amazing dedicated quality engineers also goes a long way.

  • Feature rich

There are a lot of features I really like, and are really useful as an admin deploying this service. Things like memberof (which is actually a group resolution cache when you think about it …), automember, online backup, unique attribute enforcement, dereferencing, and more.

  • The team

We have a wonderful team of really smart people, all of whom are caring and want to advance the state of identity management. Not only do they want to keep up with technical changes and excellence, they are listening to and want to improve our social awareness of identity management.

Pain Points

  • C

Because DS is written in C, it’s risky and difficult to make changes. People constantly make mistakes that introduce unsafety (even myself), and worse. No amount of tooling or intelligence can take away the fact that C is just hard to use, and people need to be perfect (people are not perfect!) and today we have better tools. We can not spend our time chasing our tails on pointless issues that C creates, when we should be doing better things.

  • Everything about dynamic admin, config, and plugins is hard and can’t scale

Because we need to maintain consistency through operations from start to end but we also allow changing config, plugins, and more during the servers operation the current locking design just doesn’t scale. It’s also not 100% safe either as the values are changed by atomics, not managed by transactions. We could use copy-on-write for this, but why? Config should be managed by tools like ansible, but today our dynamic config and plugins is both a performance over head and an admin overhead because we exclude best practice tools and have to spend a large amount of time to maintain consistent data when we shouldn’t need to. Less features is less support overhead on us, and simpler to test and assert quality and correct behaviour.

  • Plugins to address shortfalls, but a bit odd.

We have all these features to address issues, but they all do it … kind of the odd way. Managed Entries creates user private groups on object creation. But the problem is “unix requires a private group” and “ldap schema doesn’t allow a user to be a group and user at the same time”. So the answer is actually to create a new objectClass that let’s a user ALSO be it’s own UPG, not “create an object that links to the user”. (Or have a client generate the group from user attributes but we shouldn’t shift responsibility to the client.)

Distributed Numeric Assignment is based on the AD rid model, but it’s all about “how can we assign a value to a user that’s unique?”. We already have a way to do this, in the UUID, so why not derive the UID/GID from the UUID. This means there is no complex inter-server communication, pooling, just simple isolated functionality.

We have lots of features that just are a bit complex, and could have been made simpler, that now we have to support, and can’t change to make them better. If we rolled a new “fixed” version, we would then have to support both because projects like FreeIPA aren’t going to just change over.

  • client tools are controlled by others and complex (sssd, openldap)

Every tool for dealing with ldap is really confusing and arcane. They all have wild (unhelpful) defaults, and generally this scares people off. I took months of work to get a working ldap server in the past. Why? It’s 2018, things need to “just work”. Our tools should “just work”. Why should I need to hand edit pam? Why do I need to set weird options in SSSD.conf? All of this makes the whole experience poor.

We are making client tools that can help (to an extent), but they are really limited to system administration and they aren’t “generic” tools for every possible configuration that exists. So at some point people will still find a limit where they have to touch ldap commands. A common request is a simple to use web portal for password resets, which today only really exists in FreeIPA, and that limits it’s application already.

  • hard to change legacy

It’s really hard to make code changes because our surface area is so broad and the many use cases means that we risk breakage every time we do. I have even broken customer deployments like this. It’s almost impossible to get away from, and that holds us back because it means we are scared to make changes because we have to support the 1 million existing work flows. To add another is more support risk.

Many deployments use legacy schema elements that holds us back, ranging from the inet types, schema that enforces a first/last name, schema that won’t express users + groups in a simple away. It’s hard to ask people to just up and migrate their data, and even if we wanted too, ldap allows too much freedom so we are more likely to break data, than migrate it correctly if we tried.

This holds us back from technical changes, and social representation changes. People are more likely to engage with a large migrational change, than an incremental change that disturbs their current workflow (IE moving from on prem to cloud, rather than invest in smaller iterative changes to make their local solutions better).

  • ACI’s are really complex

389’s access controls are good because they are in the tree and replicated, but bad because the syntax is awful, complex, and has lots of traps and complexity. Even I need to look up how to write them when I have to. This is not good for a project that has such deep security concerns, where your ACI’s can look correct but actually expose all your data to risks.

  • LDAP as a protocol is like an 90’s drug experience

LDAP may be the lingua franca of authentication, but it’s complex, hard to use and hard to write implementations for. That’s why in opensource we have a monoculture of using the openldap client libraries because no one can work out how to write a standalone library. Layer on top the complexity of the object and naming model, and we have a situation where no one wants to interact with LDAP and rather keeps it at arm length.

It’s going to be extremely hard to move forward here, because the community is so fragmented and small, and the working groups dispersed that the idea of LDAPv4 is a dream that no one should pursue, even though it’s desperately needed.

  • TLS

TLS is great. NSS databases and tools are not.


GSSAPI and Kerberos are a piece of legacy that we just can’t escape from. They are a curse almost, and one we need to break away from as it’s completely unusable (even if it what it promises is amazing). We need to do better.

That and SSO allows loads of attacks to proceed, where we actually want isolated token auth with limited access scopes …

What could we offer

  • Web application as a first class consumer.

People want web portals for their clients, and they want to be able to use web applications as the consumer of authentication. The HTTP protocols must be the first class integration point for anything in identity management today. This means using things like OAUTH/OIDC.

  • Systems security as a first class consumer.

Administrators still need to SSH to machines, and people still need their systems to have identities running on them. Having pam/nsswitch modules is a very major requirement, where those modules have to be fast, simple, and work correctly. Users should “imply” a private group, and UID/GID should by dynamic from UUID (or admins can override it).

  • 2FA/u2f/TOTP.

Multi-factor auth is here (not coming, here), and we are behind the game. We already have Apple and MS pushing for webauthn in their devices. We need to be there for these standards to work, and to support the next authentication tool after that.

  • Good RADIUS integration.

RADIUS is not going away, and is important in education providers and business networks, so RADIUS must “just work”. Importantly, this means mschapv2 which is the universal default for all clients to operate with, which means nthash.

However, we can make the nthash unlinked from your normal password, so you can then have wifi password and a seperate loging password. We could even generate an NTHash containing the TOTP token for more high security environments.

  • better data structure (flat, defined by object types).

The tree structure of LDAP is confusing, but a flatter structure is easier to manage and understand. We can use ideas from kubernetes like tags/labels which can be used to provide certain controls and filtering capabilities for searches and access profiles to apply to.

  • structured logging, with in built performance profiling.

Being able to diagnose why an operation is slow is critical and having structured logs with profiling information is key to allowing admins and developers to resolve performance issues at scale. It’s also critical to have auditing of every single change made in the system, including internal changes that occur during operations.

  • access profiles with auditing capability.

Access profiles that express what you can access, and how. Easier to audit, generate, and should be tightly linked to group membership for real RBAC style capabilities.

  • transactions by allowing batch operations.

LDAP wants to provide a transaction system over a set of operations, but that may cause performance issues on write paths. Instead, why not allow submission of batches of changes that all must occur “at the same time” or “none”. This is faster network wise, protocol wise, and simpler for a server to implement.

What’s next then …

Instead of fixing what we have, why not take the best of what we have, and offer something new in parallel? Start a new front end that speaks in an accessible way, that has modern structures, and has learnt from the lessons of the past? We can build it to standalone, or proxy from the robust core of 389 Directory Server allowing migration paths, but eschew the pain of trying to bring people to the modern world. We can offer something unique, an open source identity system that’s easy to use, fast, secure, that you can run on your terms, or in the cloud.

This parallel project seems like a good idea … I wonder what to name it …

December 20, 2018 02:00 PM

December 08, 2018

William Brown

Work around docker exec bug

Work around docker exec bug

There is currently a docker exec bug in Centos/RHEL 7 that causes errors such as:

# docker exec -i -t instance /bin/sh
rpc error: code = 2 desc = oci runtime error: exec failed: container_linux.go:247: starting container process caused "process_linux.go:110: decoding init error from pipe caused \"read parent: connection reset by peer\""

As a work around you can use nsenter instead:

PID=docker inspect --format {{.State.Pid}} <name of container>
nsenter --target $PID --mount --uts --ipc --net --pid /bin/sh

For more information, you can see the bugreport here.

December 08, 2018 02:00 PM

November 30, 2018

Fraser Tweedale

Diagnosing Dogtag cloning failures

Diagnosing Dogtag cloning failures

Sometimes, creating a Dogtag clone or a FreeIPA CA replica fails. I worked with Dogtag and FreeIPA for nearly five years. Over these years I’ve analysed a lot of these clone/replica installation failures and internalised a lot of knowledge about how cloning works, and how it can break. Often when I read a problem report and inspect the logs I quickly get a “gut feeling” about the cause. The purpose of this post is to externalise my internal intuition so that others can benefit. Whether you are an engineer or not, this post explains what you can do to get to the bottom of Dogtag cloning failures faster.

How Dogtag clones are created

Some notes about terminology: in FreeIPA we talk about replicas, but in Dogtag we say clones. These terms mean the same thing. When you create a FreeIPA CA replica, FreeIPA creates a clone of the Dogtag CA instance behind the scenes. I will use the term master to refer to the server from which the clone/replica is being created.

The pkispawn(8) program, depending on its configuration, can be used to create a new Dogtag subsystem or a clone. pkispawn, a Python program, manages the whole clone creation process, with the possible exception of setting up LDAP database and replication. But some stages of the configuration are handled by the Dogtag server itself (thus implemented in Java). Furthermore, the Dogtag server on the master must service some requests to allow the new clone to integrate into the topology.

The high level procedure of CA cloning is roughly:

  1. (ipa-replica-install) Create temporary Dogtag admin user account and add to relevant groups
  2. (ipa-replica-install or pkispawn) Establish LDAP replication of the Dogtag database
  3. (pkispawn) Extract private keys and system certifiates into Dogtag’s NSSDB
  4. (pkispawn) Lay out the Dogtag instance on the filesystem
  5. (pkispawn) Start the pki-tomcatd instance
  6. (pkispawn) Send a configuration request to the new Dogtag instance
    1. (pki-tomcatd on clone) Send security domain login request to master (using temporary admin user credentials)
    2. (pki-tomcatd on master) Authenticate user, return cookie.
    3. (pki-tomcatd on clone) Send number range requests to master
    4. (pki-tomcatd on master) Service number range requests for clone
  7. (ipa-replica-install) remove temporary admin user account

There are several places a problem could occur: in pkispawn, pki-tomcatd on the clone, or pki-tomcatd on the master. Therefore, depending on what failed, the best data about the failure could be in pkispawn output/logs, the Dogtag debug log on the replica, or the master, or even the system journal on either of the hosts. Recommendation: when analysing Dogtag cloning or FreeIPA CA replica installation failures, inspect all of these logs. It is often not obvious where the error is occurring, or what caused it. Having all these log files helps a lot.

Case studies

Failure to set up replication

Description: ipa-replica-install or pkispawn fail with errors related to replication (failure to establish). I don’t know how common this is in production environments. I’ve encountered it in my development environments. I think it is usually caused by stale replication agreements or something of that nature.

Workaround: A “folk remedy”: uninstall and clean up the instance, then try again. Most often the error does not recur.

Replication races

Description: pkispawn fails; replica debug log indicates security domain login failure; master debug log indicates user unknown; debug log indicates token/session unkonwn

During cloning, the clone adds LDAP objects in its own database. It then performs requests against the master, assuming that those objects (or effects of other LDAP operations) have been replicated to the master. Due to replication lag, the data have not been replicated and as a consequence, a request fails.

In the past couple of years several replication races were discovered and fixed (or mitigated) in Dogtag or FreeIPA:

updateNumberRange failure due to missing session object


Description: After security domain login (locally on the replica) the session object gets replicated to the master. The cookie/token conveyed in the updateNumberRange range referred to a session that the master did not yet know about.

Resolution: the replica sleeps (duration configuration; default 5s) after security domain login, giving time for replication. This is not guaranteed the avoid the problem: the complete solution (yet to be implemented) will be to use a signed/MACed token.

Security domain login failure due to missing user or group membership


Description: This bug was actually in FreeIPA, but manifested in pki-tomcatd on master as a failure to log into the security domain. This could occur for one of two reasons: either the user was unknown, or the user was not a member of a required group. FreeIPA performs the relevant LDAP operations on the replica, but they have not replicated to master yet. The pkispawn/ipa-replica-install error message looks something like:

com.netscape.certsrv.base.PKIException: Failed to obtain
installation token from security domain:
com.netscape.certsrv.base.UnauthorizedException: User
admin-replica1.ipa.example is not a member of Enterprise CA
Administrators group.

Workaround: no supported workaround. (You could hack in a sleep though).

Resolution: The user creation routine was already waiting for replication but the wait routine had a timeout bug causing false positives, and the group memberships were not being waited on. The timeout bug was fixed. The wait routine was enhanced to support waiting for particular attribute values; this feature was used to ensure group memberships have been replicated before continuing.

Other updateNumberRange failures


Description: When creating a clone from a master that was itself a clone, an updateNumberRange request fails at master with status 500. A NullPointerException backtrace appears in the journal for the pki-tomcatd@pki-tomcat unit (on master). The problem arises because the initial number range assignment for the second clone is equal to the range size of the first clone (range transfer size is a fixed number). This scenario was not handled correctly, leading to the exception.

Workaround: Ensure that each clone services one of each kind of number (e.g. one full certificate request and issuance operation). This ensures that the clone’s range is smaller than the range transfer size, so that a subsequent updateNumberRange request will be satisfied from the master’s “standby” range.

Resolution: detect range depletion due to updateNumberRange requests and eagerly switch to the standby range. A better fix (yet to be implemented) will be to allocate each clone a full-sized range from the unallocated numbers.


Dogtag subsystem cloning is a complex procedure. Even more so in the FreeIPA context. There are lots of places failure can occur.

The case studies above are a few examples of difficult-to-debug failures where the cause was non-obvious. Often the error occurs on a different host (the master) from where the error was observed. And the important data about the true cause may reside in ipareplica-install.log, pkispawn log output, the Dogtag CA debug log (on replica or master) or the system journal (again on replica or master). Sometimes the 389DS logs can be helpful too.

Normally the fastest way to understand a problem is to gather all these sources of data and look at them all around the time the error occurred. When you see one failure, don’t assume that that is the failure. Cross-reference the log files. If you can’t see anything about an error, you probably need to look in a different file…

…or a different part of the file! It is important to note that Dogtag time stamps are in local time, whereas most other logs are UTC. Different machines in the topology can be in different timezones, so you could be dealing with up to three timezones across the log files. Check carefully what timezone the timestamps are in when you are “lining up” the logfiles. Many times I have seen (and often erred myself) an incorrect conclusion that “there is no error in the debug log” because of this trap.

In my experience, the most common causes of Dogtag cloning failure have involved Security Domain authentication issues and number range management. Over time I and others have fixed several bugs in these areas, but I am not confident that all potential problems have been fixed. The good news is that checking all the relevant logs usually leads to a good theory about the root cause.

What if you are not an engineer or not able to make sense of the Dogtag codebase? (This is fine by the way—Dogtag is a huge, gnarly beast!) The best thing you can do to help us analyse and resolve the issue is to collect all the logs (from the master and replica) and prune them to the relevant timeframe (minding the timezones) before passing them on for an engineer to analyse.

In this post I only looked at Dogtag cloning failures. I have lots of other Dogtag “gut knowledge” that I plan to get out in upcoming posts.

November 30, 2018 12:00 AM

November 27, 2018

Rob Crittenden

Setting up a Mac (OSX) as an IPA client

I periodically see people trying to setup a Mac running OSX as an IPA client. I don’t have one myself so can’t really provide assistance.

There is this guide which seems to be pretty thorough,

This upstream ticket also has some information on setting up a client, though it isn’t always directly related to simply configuring a client,

So I record this here so I know where to look later 🙂

by rcritten at November 27, 2018 10:10 PM

November 26, 2018

Rob Crittenden

How do I get a certificate for my web site with IPA?

That’s a bit of a loaded question that begs additional questions:

  1. Is the web server enrolled as an IPA client?
  2. What format does the private key and certificate need to be in? (OpenSSL-style PEM, NSS, other?)

If the answer to question 1 is YES then you can do this on that client machine (to be clear, this first step can be done anywhere or in the UI):

$ kinit admin$ ipa service-add HTTP/

You can use certmonger to request and manage the certificate which includes renewals (bonus!).

If you are using NSS and let’s say mod_nss you’d do something like after creating the database and/or putting the password into /etc/httpd/alias/pwdfile.txt:

# ipa-getcert request -K HTTP/ -d /etc/httpd/alias -n MyWebCert -p /etc/httpd/alias/pwdfile.txt -D

Let’s break down what these options mean:

  • -K is the Kerberos principal to store the certificate with. You do NOT need to get a keytab for this service
  • -d the NSS database directory. You can use whatever you want but be sure the service has read access and SELinux permission access to it.
  • -n the NSS nickname. This is just a shortcut name to your cert, use what you want.
  • -p the path to the pin/password for the NSS database
  • -D creates a DNS SAN for the hostname This is current best practice.

You’ll also need to add the IPA cert chain to the NSS database using certutil.

If you are using OpenSSL and say mod_ssl you’d do something like:

# ipa-getcert request -K HTTP/ -k /etc/pki/tls/private/httpd.key -f /etc/pki/tls/certs/httpd.pem -D

Similar options as above but instad -f -d and -n:

  • -k path to the key file
  • -f path to the certificate file

To check on the status of your new request you can run:

# ipa-getcert list -n <numeric id that was spit out before>

It should be in status MONITORING.

If the answer to #1 is NO then you have two options: use certmonger on a different machine to generate the key and certificate and transfer them to the target or generate a CSR manually.

For the first case, using certmonger on a different machine, the steps are similar to the YES case.

Create a host and service for the web server:

$ kinit admin$ ipa host-add$ ipa service-add HTTP/

Now we need to grant the rights to the current machine to get certificates for the HTTP service of

$ ipa service-add-host --hosts <your current machine FQDN> HTTP/

Now run the appropriate ipa-getcert command above to match the format you need and check the status in a similar way.

Once it’s done you need to transfer the cert and key to the webserver.

Finally, if you want to get certificates on an un-enrolled system the basic steps are:

  1. Create a host entry and service as above
  2. Generate a CSR, see (or the next section)
  3. Submit that CSR per the above docs

If your webserver is not registered in DNS then you can use the –force option to host-add and service-add to force their creation.

This should pretty generically apply to all versions of IPA v4+, and probably to v3 as well.


by rcritten at November 26, 2018 09:48 PM

November 20, 2018

Fraser Tweedale

FreeIPA CA renewal master explained

FreeIPA CA renewal master explained

Every FreeIPA deployment has a critical setting called the CA renewal master. In this post I explain how this setting is used, why it is important, and the consequences of improper configuration. I’ll also discuss scenarios which cause the value to change, and why and how you would change it manually.

What is the CA renewal master?

The CA renewal master configuration controls which CA replica is responsible for renewing some important certificate used within a FreeIPA deployment. I will call these system certificates.

Unlike service certificates (e.g. for HTTP and LDAP) which have different keypairs and subject names on different servers, FreeIPA system certificates, and their keys, are shared by all CA replicas. These include the IPA CA certificate, OCSP certificate, Dogtag subsystem certificates, Dogtag audit signing certificate, IPA RA agent certificate and KRA transport and storage certificates.

The current CA renewal master configuration can be viewed via ipa config-show:

[f28-1] ftweedal% ipa config-show | grep 'CA renewal master'
  IPA CA renewal master: f28-1.ipa.local

Under the hood, this configuration is a server role attribute. The CA renewal master is indicated by the presence of an (ipaConfigString=caRenewalMaster) attribute value on an IPA server’s CA role object. You can determine the renewal master via a plain LDAP search:

[f28-1] ftweedal% ldapsearch -LLL \
      -D "cn=Directory Manager" \
dn: cn=CA,cn=f28-1.ipa.local,cn=masters,cn=ipa,cn=etc,dc=ipa,dc=local
objectClass: nsContainer
objectClass: ipaConfigObject
objectClass: top
cn: CA
ipaConfigString: startOrder 50
ipaConfigString: caRenewalMaster
ipaConfigString: enabledService

The configuration is automatically set to the first master in the topology on which the CA role was installed. Unless you installed without a CA, this is the original master set up via ipa-server-install.

What problem is solved by having a CA renewal master?

All CA replicas have tracking requests for all system certificates. But if all CA replicas renewed system certificates independently, they would end up with different certificates. This is especially a problem for the CA certificate, and the subsystem and IPA RA certificates which get stored in LDAP for authentication purposes. The certificates must match exactly, otherwise there will be authentication failures between the FreeIPA framework and Dogtag, and between Dogtag and LDAP.

Appointing one CA replica as the renewal master allows the system certificates to be renewed exactly once, when required.

How do other replicas acquire the updated certificates?

The Certmonger tracking requests on all CA replicas use the dogtag-ipa-ca-renew-agent renewal helper. This program reads the CA renewal master configuration. If the current host is the renewal master, it performs the renewal, and stores the certificate in LDAP under cn=<nickname>,cn=ca_renewal,cn=ipa,cn=etc,{basedn}. Additionally, if the certificate is the IPA RA or the Dogtag CA subsystem certificate, the new certificate gets added to the userCertificate attribute of the corresponding LDAP user entry

If the renewal master is a different host, the latest certificate is retrieved from the ca_renewal LDAP entry and returned to Certmonger. Due to non-determinism in exactly when Certmonger renewal attempts will occur, the non-renewal helper could attempt to “renew” the certificate before the renewal master has actually renewed the certificate. So it is not an error for the renewal helper to return the old (soon to expire) certificate. Certmonger will keep attempting to renew the certificate (with some delay between attempts) until it can retrieve the updated certificate (which will not expire soon).

What can go wrong?

If it wasn’t clear already, a (CA-ful) FreeIPA deployment must at all times have exactly one CA replica configured as the renewal master. That server must be online, operating normally, and replicating properly with other servers. Let’s look at what happens if these conditions are not met.

If the CA renewal master configuration refers to a server that has been decommissioned, or is offline, then no server will actually renew the certificates. All the non-renewal master servers will happily reinstall the current certificate, until they expire, and things will break. The troublesome thing about certificates is even one expired certificate can cause renewal failures for other certificates. The problems cascade and eventually the whole deployment is busted.

FreeIPA has a simple protection in place to ensure the renewal master configuration stays valid. Servers can be deleted from the topology via the ipa server-del, ipa-replica-manage del, ipa-csreplica-manage del or ipa-server-install --uninstall command. In these commands, if the server being deleted is the current CA renewal master, a different CA replica is elected as the new CA renewal master.

These protections only go so far. If the renewal master is still part of the topology but is offline for an extended duration it may miss a renewal window, causing expired certificates. If there are replication problems between the renewal master and other CA replicas, renewal might succeed, but the other CA replicas might not be able to retrieve the updated certificates before they expire. All of these problems (and more) have been seen in the wild.

I have seen cases where a CA renewal master was simply decommissioned without formally removing it from the FreeIPA topology. I have also seen cases where there was no CA renewal master configured (I do not know how this situation arose). Both of these scenarios have similar consequences to the “offline for extended duration” scenario.

What would happen if you had two (or more) CA replicas with (ipaConfigString=caRenewalMaster)? I haven’t seen this one in the wild, but I would not be surprised if one day I did see it. In this case, multiple CA replicas will perform renewals. Will clobber each others’ certificates, and will result in some replicas having RA Agent or Dogtag subsystem certificates out of sync with the corresponding user entries in LDAP. This is a less catastrophic consequence than the aforementioned scenarios, but still serious. It will result in Dogtag or IPA RA authentication failures on some (or most) CA replicas.

Why and how to change the CA renewal master

Why would you need to change the renewal master configuration? Assuming the existing configuration is valid, the main reason you would need to change it is in anticipation of the decommissioning of the existing CA renewal master. You may wish to appoint a particular server as the new renewal master. As discussed above, the commands that remove servers from the topology will do this automatically, but which server will be chosen is out of your hands. So you can get one step ahead and change the renewal master yourself.

In my test setup there are two CA replicas:

[f28-1] ftweedal% ipa server-role-find --role 'CA server'
2 server roles matched
  Server name: f28-0.ipa.local
  Role name: CA server
  Role status: enabled

  Server name: f28-1.ipa.local
  Role name: CA server
  Role status: enabled
Number of entries returned 2

The current renewal master is f28-1.ipa.local:

[f28-1] ftweedal% ipa config-show | grep 'CA renewal master'
  IPA CA renewal master: f28-1.ipa.local

The preferred way to change the renewal master configuration is via the ipa config-mod command:

[f28-1] ftweedal% ipa config-mod \
      --ca-renewal-master-server f28-0.ipa.local \
      | grep 'CA renewal master'
  IPA CA renewal master: f28-0.ipa.local

You can also use the ipa-csreplica-manage command. This requires the Directory Manager passphrase:

[f28-1] ftweedal% ipa-csreplica-manage \
                    set-renewal-master f28-1.ipa.local
Directory Manager password: XXXXXXXX

f28-1.ipa.local is now the renewal master

If for whatever reason the current renewal master configuration is invalid, you can use these same commands to reset it. As a last resort, you can modify the LDAP objects directly to ensure that exactly one CA role object has (ipaConfigString=caRenewalMaster). Note that both the attribute name (ipaConfigString) and value (caRenewalMaster) are case-insensitive.

Finally, let’s observe what happens when we remove a server from the topology. I’ll remove f28-1.ipa.local (the current renewal master) using the ipa-server-install --uninstall command. After this operation, the CA renewal master configuration should point to f28-0.ipa.local (the only other CA replica in the topology).

[f28-1:~] ftweedal% sudo ipa-server-install --uninstall

This is a NON REVERSIBLE operation and will delete all data
and configuration!
It is highly recommended to take a backup of existing data
and configuration using ipa-backup utility before proceeding.

Are you sure you want to continue with the uninstall procedure? [no]: yes
Forcing removal of f28-1.ipa.local
Failed to cleanup f28-1.ipa.local DNS entries: DNS is not configured
You may need to manually remove them from the tree
Deleted IPA server "f28-1.ipa.local"
Shutting down all IPA services
Unconfiguring CA
... (snip!)
Client uninstall complete.
The ipa-client-install command was successful
The ipa-server-install command was successful

Jumping across to f28-0.ipa.local, I confirm that f28-0.ipa.local has become the renewal master:

[f28-0] ftweedal% ipa config-show |grep 'CA renewal master'
  IPA CA renewal master: f28-0.ipa.local

Explicit CA certificate renewal

There is one more scenario that can cause the CA renewal master to be changed. When the IPA CA certificate is explicitly renewed via the ipa-cacert-manage renew command the server on which the operation is performed becomes the CA renewal master. This is to cause the CA replica that was the renewal master to retrieve the new CA certificate from LDAP instead of renewing it.


In this post I explained what the CA renewal master configuration is for and what it looks like under the hood. For FreeIPA/Dogtag system certificates, the CA renewal master configuration controls which CA replica actually performs renewal. The CA renewal master stores the renewed certificates in LDAP, and all other CA replicas look for them there. The dogtag-ipa-ca-renew-agent Certmonger renewal helper implements both of these behaviours, using the CA renewal master configuration to decide which behaviour to execute.

There must be exactly one CA renewal master in a topology and it must be operational. I discussed the consequences of various configuration or operational problems. I also explained why you might want to change the CA renewal master, and how to do it.

The CA renewal master is a critical configuration and incorrect renewal master configuration is often a factor in complex customer cases involving FreeIPA’s PKI. Commands that remove servers from the topology should elect a new CA renewal master when necessary. But misconfigurations do arise (if only we could know all the ways how!)

The upcoming FreeIPA Healthcheck feature will, among other checks, confirm that the CA renewal master configuration is sane. It will not (in the beginning at least) be able to diagnose availability or connectivity issues. But it should be able to catch some misconfigurations before they lead to catastrophic failure of the deployment.

November 20, 2018 12:00 AM

October 31, 2018

William Brown

High Available RADVD on Linux

High Available RADVD on Linux

Recently I was experimenting again with high availability router configurations so that in the cause of an outage or a failover the other router will take over and traffic is still served.

This is usually done through protocols like VRRP to allow virtual ips to exist that can be failed between. However with ipv6 one needs to still allow clients to find the router, and in the cause of a failure, the router advertisments still must continue for client renewals.

To achieve this we need two parts. A shared Link Local address, and a special RADVD configuration.

Because of howe ipv6 routers work, all traffic (even global) is still sent to your link local router. We can use an address like:


This doesn’t clash with any reserved or special ipv6 addresses, and it’s easy to remember. Because of how link local works, we can put this on many interfaces of the router (many vlans) with no conflict.

So now to the two components.


Keepalived is a VRRP implementation for linux. It has extensive documentation and sometimes uses some implementation specific language, but it works well for what it does.

Our configuration looks like:

#  /etc/keepalived/keepalived.conf
global_defs {
  vrrp_version 3

vrrp_sync_group G1 {
 group {

vrrp_instance ipv6_ens256 {
   interface ens256
   virtual_router_id 62
   priority 50
   advert_int 1.0
   virtual_ipaddress {
   garp_master_delay 1

Note that we provide both a global address and an LL address for the failover. This is important for services and DNS for the router to have the global, but you could omit this. The LL address however is critical to this configuration and must be present.

Now you can start up vrrp, and you should see one of your two linux machines pick up the address.


For RADVD to work, a feature of the 2.x series is required. Packaging this for el7 is out of scope of this post, but fedora ships the version required.

The feature is that RADVD can be configured to specify which address it advertises for the router, rather than assuming the interface LL autoconf address is the address to advertise. The configuration appears as:

# /etc/radvd.conf
interface ens256
    AdvSendAdvert on;
    MinRtrAdvInterval 30;
    MaxRtrAdvInterval 100;
    AdvRASrcAddress {
    prefix 2001:db8::/64
        AdvOnLink on;
        AdvAutonomous on;
        AdvRouterAddr off;

Note the AdvRASrcAddress parameter? This defines a priority list of address to advertise that could be available on the interface.

Now start up radvd on your two routers, and try failing over between them while you ping from your client. Remember to ping LL from a client you need something like:

ping6 fe80::1:1%en1

Where the outgoing interface of your client traffic is denoted after the ‘%’.

Happy failover routing!

October 31, 2018 02:00 PM

Powered by Planet