Skip to content

2024

Use databricks profiles to emulate service principals

TL;DR: Edit your ~/.databrickscfg file to create a profile for your service principals.

Problem: The feedback loop when developing a CI/CD pipeline is too slow

I have a CI/CD pipeline that interacts with a Databricks workspace through the Databricks CLI. I usually develop the pipeline locally, testing it against a sandbox Databricks workspace, authenticated as myself.

But when I deploy the pipeline to the CI/CD environment, it runs as a service principal, first against a dev workspace, then against a prod workspace.

There can be some issues that only appear when running as a service principal, like permissions errors or workspace configurations. And the feedback loop is too slow: I have to commit, push, wait for the pipeline to run, check the logs, and repeat.

I want to test the pipeline locally, authenticated as a service principal, to catch these issues earlier.

Solution: Use databricks profiles to emulate service principals

Reading about the one million ways to authenticate to an Azure Databricks workspace is enough to give me a headache (Seriously, there are too many options). I have previously used environment variables to authenticate as a service principal, the various secrets in an .env file, and commenting and un-commenting as needed. It is a mess, and I'm guaranteed to forget to switch back to my user account at some point.

Instead, I can use databricks profiles to store the different authentication configurations. In ~/.databrickscfg, I can create a profile for each service principal, and switch between them with the --profile flag.

Here is an example of a ~/.databrickscfg file with two Service principal profiles:

.databrickscfg
[DEFAULT]
host  = <SOME_HOST>
token = <SOME_TOKEN>

[project-prod-sp]
host                = 
azure_client_id     = 
azure_client_secret = 
azure_tenant_id     = 

[project-dev-sp]
<same setup as above>

Of course, you should replace the placeholders with the actual values.

To test what workspace and user your profile is using, you can try the following command:

databricks auth describe --profile project-prod-sp

This will also show you where the authentication is coming from (because, as I mentioned above, there are too many ways to authenticate).

Finally, you can run your pipeline locally, using the --profile flag to specify that you want to use the service principal profile:

databricks bundle deploy --profile project-dev-sp

Alternative to using --profile flag

If you still want to use environment variables, you can set the DATABRICKS_CONFIG_PROFILE variable to the profile name you want to use, e.g.:

DATABRICKS_CONFIG_PROFILE=DEFAULT

Test kafka clients with Docker

TL;DR: Use pytest-docker to create a test fixture that starts a Kafka container.

Problem: I want to test my Kafka client, but I don't have a Kafka cluster

At work, we need to consume and produce messages to some queue. And one of the tools available already is Kafka.

Before integrating with the existing Kafka cluster, I want to test my client code. I want to ensure that it can consume and produce messages correctly.

I have an existing BaseQueueService class like this:

class BaseQueueService(ABC):
    @abstractmethod
    def publish(self, message: str) -> None:
        pass

    @abstractmethod
    def consume(self) -> str | None:
        pass

with existing implementations for Azure Service Bus and an InMemoryQueue for testing business logic.

So I want to create a KafkaQueueService class that implements this interface. And I want to test it, but I don't have a Kafka cluster available.

Solution: Use docker to start a Kafka container for testing

I can use pytest-docker to create a test fixture that starts a Kafka container. This way, I can test my KafkaQueueService class without needing a Kafka cluster.

This is how I did it:

A docker-compose.yml file to start a Kafka container:

docker-compose.yml
services:
  zookeeper:
    image: 'confluentinc/cp-zookeeper:latest'
    environment:
      ZOOKEEPER_CLIENT_PORT: 2181
      ZOOKEEPER_TICK_TIME: 2000
    ports:
      - "2181:2181"

  kafka:
    image: 'confluentinc/cp-kafka:latest'
    depends_on:
      - zookeeper
    environment:
      KAFKA_BROKER_ID: 1
      KAFKA_ZOOKEEPER_CONNECT: zookeeper:2181
      KAFKA_ADVERTISED_LISTENERS: PLAINTEXT://kafka:29092,PLAINTEXT_HOST://localhost:9092
      KAFKA_OFFSETS_TOPIC_REPLICATION_FACTOR: 1
      KAFKA_LISTENER_SECURITY_PROTOCOL_MAP: PLAINTEXT:PLAINTEXT,PLAINTEXT_HOST:PLAINTEXT
    ports:
      - "9092:9092"
    expose:
      - "29092"

  init-kafka:
    image: 'confluentinc/cp-kafka:latest'
    depends_on:
      - kafka
    entrypoint: [ '/bin/sh', '-c' ]
    command: |
      "
      # blocks until kafka is reachable
      kafka-topics --bootstrap-server kafka:29092 --list

      echo -e 'Creating kafka topics'
      kafka-topics --bootstrap-server kafka:29092 --create --if-not-exists --topic testtopic --replication-factor 1 --partitions 1
      kafka-topics --bootstrap-server kafka:29092 --create --if-not-exists --topic input_test_topic --replication-factor 1 --partitions 1
      kafka-topics --bootstrap-server kafka:29092 --create --if-not-exists --topic output_test_topic --replication-factor 1 --partitions 1

      echo -e 'Successfully created the following topics:'
      kafka-topics --bootstrap-server kafka:29092 --list
      "

A conftest.py file to create a test fixture that starts the Kafka container:

def check_kafka_ready(required_topics, host="localhost", port=9092):
    from confluent_kafka import KafkaException
    from confluent_kafka.admin import AdminClient

    try:
        admin = AdminClient({"bootstrap.servers": f"{host}:{port}"})
        topics = admin.list_topics(timeout=5)
        # Check if all required topics are present
        if all(topic in topics.topics for topic in required_topics):
            return True
        else:
            return False
    except KafkaException:
        return False


@pytest.fixture(scope="session")
def kafka_url(docker_services):
    """Start kafka service and return the url."""
    port = docker_services.port_for("kafka", 9092)
    required_topics = ["testtopic", "input_test_topic", "output_test_topic"]
    docker_services.wait_until_responsive(
        check=lambda: check_kafka_ready(port=port, required_topics=required_topics),
        timeout=30.0,
        pause=0.1,
    )
    return f"localhost:{port}"

And finally, a test file to test the KafkaQueueService class:

@pytest.mark.kafka
def test_kafka_queue_can_publish_and_consume(kafka_url):
    kafka_queue_service = KafkaQueueService(
        broker=kafka_url,
        topic="testtopic",
        group_id="testgroup",
    )
    clear_messages_from_queue(kafka_queue_service)

    unique_message = "hello" + str(uuid.uuid4())
    kafka_queue_service.publish(unique_message)

    received_message = kafka_queue_service.consume()
    assert received_message == unique_message

Now I can test my KafkaQueueService class without needing a Kafka cluster. This even works on my CI/CD pipeline in Azure DevOps.

NOTE: The docker-services fixture starts ALL the docker services in the docker-compose.yml file.

Bonus: The passing implementation of KafkaQueueService

This passes the test above (and a few other tests I wrote):

from confluent_kafka import Consumer, KafkaError, Producer

class KafkaQueueService(BaseQueueService):
    def __init__(self, broker: str, topic: str, group_id: str):
        # Configuration for the producer and consumer
        self.topic = topic
        self.producer: Producer = Producer({"bootstrap.servers": broker})
        self.consumer: Consumer = Consumer(
            {
                "bootstrap.servers": broker,
                "group.id": group_id,
                "auto.offset.reset": "earliest",
                "enable.partition.eof": "true",
            }
        )
        self.consumer.subscribe([self.topic])

    def publish(self, message: str) -> None:
        """Publish a message to the Kafka topic."""
        logger.debug(f"Publishing message to topic {self.topic}: {message}")

        self.producer.produce(self.topic, message.encode("utf-8"))
        self.producer.flush()

    def consume(self) -> str | None:
        """Consume a single message from the Kafka topic."""
        logger.debug(f"Consuming message from topic {self.topic}")

        # Get the next message
        message = self.consumer.poll(timeout=20)
        if message is None:
            logger.debug("Consumer poll timeout")
            return None
        # No new message
        if message.error() is not None and message.error().code() == KafkaError._PARTITION_EOF:
            logger.debug("No new messages in topic")
            return None
        # Check for errors
        if message.error() is not None:
            raise Exception(f"Consumer error: {message.error()}")
        self.consumer.commit(message, asynchronous=False)
        return message.value().decode("utf-8")

    def __repr__(self) -> str:
        return f"{self.__class__.__name__}(topic={self.topic})"

Build docker images on remote Linux VM

TL;DR: Create a Linux VM in the cloud, then create a docker context for it with

docker context create linux-builder --docker "host=ssh://username@remote-ip"

then build your image with

docker buildx build --context linux-builder --platform linux/amd64 -t my-image .

Problem: Building some Docker images on a modern Mac fails

At work, I'm using an M3 Macbook. It's a great machine, but it's not perfect. One issue is that I can't always build Docker images target to linux/amd64 on it.

Recently, I had an issue where I needed to package a Python application in Docker, and one of the dependencies was pytorch. I suspect that is where my issue was coming from.

Building the image on Mac works fine when running it on the same machine, but when I try to run it on a Linux machine, it fails with the following error:

exec /app/.venv/bin/python: exec format error

This indicated that the Python binary was built for the wrong architecture. Luckily, you can specify the target architecture using the --platform flag when building the image.

docker buildx build --platform linux/amd64 -t my-image .

Unfortunately, this didn't work for me. I suspect that the pytorch dependency was causing the issue. I got the following error:

Cannot install nvidia-cublas-cu12.

Solution: Build the image on a remote Linux VM

To solve this issue, I decided to build the image on a remote x86_64 Linux VM. This way, I can ensure that the image is built for the correct architecture.

I used an Azure Virtual Machine with an Ubuntu 24.04 image. I enabled "Auto-shutdown" at midnight every day to save costs.

After ssh-ing into the VM, I installed docker and ensured the user was added to the docker group.

curl -fsSL https://get.docker.com -o get-docker.sh
sudo sh get-docker.sh
sudo usermod -aG docker azureuser

Check that the docker daemon is running:

sudo systemctl status docker

Now, back on my local machine, I created a docker context for the remote VM:

docker context create linux-builder --docker "host=ssh://azureuser@remote-ip"

Now, I can build the image using the context:

docker buildx build --context linux-builder --platform linux/amd64 -t my-image .

I can also enable the context for all future commands:

docker context use linux-builder