Enterprise DVR and Search
Backup and Disaster Recovery Software
Performance and Security Monitoring Software
Bachelor of Science in Computer Science
The resource management service is the central hub for activity in the cloud environment. It receives requests from customer's appliances and cloud nodes and provides answers to questions such as "Where should I send this data?" and "Where can I find the data I sent you?". It is also responsible for triggering actions against the environment and customer's appliances ranging from starting a DR in the cloud to triggering an update of the customer's software.
The service was written using Go and interactions with it are made through RESTful APIs. It is backed by a three node Cassandra cluster and handles thousands of requests per second. All requests are made using HTTPS with both client and server authentication enabled and it serves as the certificate authority for all nodes in the ecosystem.
While the implementation of a HTTPS RESTful web service is not groundbreaking the difficult part lies in being a central service that must coordinate with applications and services written by different teams. This has helped to highlight the importance of not just good documentation but also making sure that releases are managed properly. It is also easy to fall into the trap of only implementing the APIs that are needed to achieve the intended functionality while overlooking the APIs and tools needed for day-to-day monitoring and maintenance of the environment.
The node control service is foundational service we can install on each node in the ecosystem that provides the tools we need to monitor and manage the node. Each machine runs a copy of the service that is uniquely identified which allows us to monitor health and perform actions against the node.
The node control service is a small service written in Go that heartbeats periodically, performs requested actions and monitors the health of the machine. It is lightweight in that it does not utilize a database or need to store extensive data, instead it relies on the information it gets from the resource management service via HTTPS interactions.
The control service is a pattern seen in many products, usually in the form of an agent service. The difficulty is that the scale of the environment includes thousands of machines making requests from all over the world. This scale and diversity has created many lessons in how to ensure consistent connectivity, identifying and resolving network issues as well as making sure the systems stay responsive under constantly increasing loads.
My responsibilities not only included building out the cloud infrastructure pieces but also working with the established product to make it an important component in the cloud. Becoming proficient with this mature code base was important in making sure that the solutions I produced were well designed and that I could maintain my velocity without having to rely heavily on team members.
Learning an existing code base that is over ten years old is always a challenge but expanding it for modern use cases with larger workloads exposed various latent bottlenecks and defects. This was an important learning exercise in helping me to understand the backup domain while reinforcing the importance of being thorough in the development and testing of changes.
Developing a network protocol to work across the WAN introduces complications not normally seen on a LAN. These problems are magnified by the fact that being a backup solution means that large amounts of data must be transmitted over long periods of time. One of the key initiatives I drove was the implementation of a protocol that met our customer's needs. It needed to be resumable, secure and performant while still managing to protect the integrity of the data.
The protocol was implemented in Go and is completely custom without the use of existing protocol technologies. The reason for a completely custom protocol is because due to the amount of data that needs to be transferred we need to maintain full control of protocol overhead and behavior. The security of the protocol is protected with SSL encryption and certificate authentication.
The WAN introduces many unknowns to network communication. There are unexpected hurdles that we encounter ranging from DNS failures to invasive firewalls tampering with certificates. Creating a protocol that responds properly to these situations is a challenge that extends beyond the initial implementation and teaches many lessons in the importance of negative testing and comprehensive logging. The delicate nature of the data we transmit means that extra effort was required to ensure that it was recieved and persisted safely. It is therefore important to not only make sure that the protocol is well-tested but that we also have appropriate safety nets in place for when something inevitably fails.
The established product used many common technologies found in Java code bases from the time period. One such technology was the use of Hibernate to manage the database stored in the customer's backup data. This introduced various complications:
Because of the problems we were facing I made a decision early on to invest in replacing Hibernate to simplify the product and give us the level of control we needed to grow it over the years that followed.
I spent most of my efforts refactoring the existing Java code and remove its reliance on Hibernate while still trying to retain the expected characteristics (data structures, transactions, integrity, etc). SQL tables were replaced with JSON documents that could be easily read and consumed. Embedded database libraries were replaced with a database service that provided basic transactions and brokered requests.
Hibernate is a powerful tool but it takes away control of your data. This can give a young product the velocity it needs to reach market but it may incur penalties later on by making support and maintenance more expensive. However these are perfectly solvable problems without having to replace Hibernate in the product. The complication we encountered was due to the number of database instances we had to load on a single cloud we could not afford the overhead that Hibernate introduces any more. It made more sense in our situation to take the risk of replacing Hibernate in hopes that it pay off further down the road. As a result our time to diagnose problems has dropped, we have more flexibility in how a problem can be solved and we have a much stronger understanding of the workload in the cloud with more direct control of it.
Disaster recovery is an important feature in any modern backup solution. The popularity of virtual machines has introduced the concept of restoring the backed up data to a VM to help a business resume activity while the original hardware is repaired or replaced. One of my first projects was introducing the initial framework for restoring backup data to a Virtualbox VM and laying the foundations for the improvements that would follow in the years afterwards.
The first pass of the solution was a complex dance between new Go services, existing Java APIs and interactions with the Virtualbox hypervisor. Eventually this segued into further improvements to how we access the backup data and direct manipulation of the VM's VMDK disk images.
Working on the disaster recovery solution was a constant exercise in identifying and fixing bottlenecks. A lot of my time was spent peeling away layers and finding more direct routes for getting the data from one point to another. This resulted in initiatives such as rewriting our Java encryption code in C++ so we could more efficiently access and decrypt the backup data. Another initiative also involved implementing custom code for interacting with VMDK files so we could directly insert the backup data without having to rely on the hypervisor to perform the initial write for us.
My work on cloud load balancing represented a year's worth of planning and coordination resulting in many incremental changes that would coalesce into the final solution. The goal of this work was to reduce the ownership cost of the cloud environment as it continued to grow to keep it from negatively impacting our quality of life.
The first step in the load balancing solution was making sure we had mechanisms to safely manage customer data without risking corruption or losing data. This started with various API's to pause/disable activity on the customer's data such that we would know it was safe to move around. Then I introduced a simple file transfer protocol that would move the data between two nodes in the environment and provide the necessary confirmations that the data was safely moved. Finally, once we had confidence that the previously mentioned mechanisms were functioning properly we could remove the manual step of cleaning up the original copy and fully automate the process of moving data by wrapping everything into a single, easy to use API. By reducing this to a single point of entry we can eliminate opportunities for user error while continuing to improve our ability to audit the work being done.
Now that the immediate concern of managing the data was solved I was able to focus my efforts on improving the level of data we have about our environment. This involved gathering metrics about the customer's data and the environment itself such as: how often the data is updated, the size of each update, how much it costs to process the update, the age of the data and the complexity of the data. These metrics are stored with the customer data in such a way that the historical data is preserved so we can understand how the data changed over its lifetime. This is an important part in understanding the current state of the system as well as predicting how new data might behave over its lifespan.
As the first two phases were being implemented I started work on a web-based tool that would be the team's window into the environment. The purpose of this tool was to move the team away from accessing the environment's APIs directly to make interacting with the environment less difficult (and less error prone) while also improving the amount of security and auditing protecting system-critical APIs. Using the control panel the team could:
The final step in the load balancing solution was to automate the various actions that are taken by the environment maintainers so their time is no longer spent on repetitive maintenance tasks. Using the control panel as the foundation this began by breaking the work down into separate load balancing engines. Each engine is held responsible for performing a specific task. For example, one engine would be responsible for handling volumes that are running out of space and identify data that should be moved off while another engine would be responsible for ensuring that IO levels on a node are not too high and move data to reduce hot-spots in the environment. By breaking up the work into multiple engines I was able to prioritize work and reduce the complexity of the implementation. As the engines detected actions to take they would generate work items that team members could approve which eventually would not require human interactions at all once confidence in the engine was built up.
The length and complexity of this project helped to highlight the importance of starting with a good design document and why an agile approach to development results in a more robust solution. By breaking down the project into smaller pieces I was able to better prioritize work and produce deliverables that would be immediately useful to the team and customers without them having to wait for the final solution. This created an added benefit in that items created for a previous release were allowed time to bake and mature while work was being done on the next item. I also learned early on that it is best to build a solution with the expectation it will have issues versus attempting to make it perfect on the first try. For example, instead of assuming that a load balancing engine will always produce valid work items it is best to assume it wont and build an initial solution that requires human approval of work items. Once confidence is high that the work items are correct then we can flip a switch to have the system automatically approve them on its own.
Seed drives are a mechanism used by customers with large amounts of data but slow network connections that allows them to mail a drive with data on it to us and we will upload it to the cloud environment on their behalf. The original implementation of this work was done by a coworker but in order for us to handle seed drives in remote data centers (UK, Australia) we would need a simpler approach that was easier to maintain. This resulted in an initiative to rework the existing mechanisms by simplifying or replacing pieces of them in order to reduce the cost of processing seed drives.
The revised solution replaced a need for expensive desktop machines to process the data with something that would process them on small form factor devices that met clearly defined minimal requirements. This allowed for us to buy or ship cheap hardware for the remote data centers and in some cases even using small ARM-based devices for processing the drives. Additionally I worked with a coworker to replace the need for a hardware console to manage the states of the drives and instead rely on a web UI that could be remotely accessed. This was done by defining the web API's in advance which allowed me to work on implementing the APIs while he fleshed out the web UI that would use them. In the process the uploading of the data was also greatly simplified by levaraging new features introduced in my work on load balancing. This has helped immensely in reducing the amount of effort needed to setup and maintain these machines.
This was a valuable project that demonstrated the importance of being able to define an API and use it as a way to divide work and speed up the implementation. It was also a good exercise in building a solution that would survive when it is constrained by additional factors such as being hard to access or needing to run on cheap hardware. Also, since some of this work leveraged new features introduced by the load balancing work it became an important example of why making new features simple and versatile can benefit other teams and projects.
An important component in the health of a complex cloud environment is the management of new releases. The goal of every release is to ensure the changes are fully tested (in isolation and at scale) and making sure that only the expected changes make it into production. My work on this focused not just on improving our tools but also making sure that the processes used improved as our environment grew.
A key mechanism introduced was a web based UI that was responsible for building our final release artifacts (apt repositories and installation media) and manage the pipeline they flow up to ensure that we only deploy what we intended to deploy. This was a custom web UI written with javascript, HTML and a CSS framework named Materialize. It was hosted on a service written in Go that would provide the artifacts and the business logic for generating them. Each release was represented with a JSON document that would give us the control we needed over what goes into the release and how it gets consumed further down the chain.
It is difficult to build confidence that a release will not introduce unexpected changes to the environment. My work on this has helped stress the importance of good branch management in source control, proper communication with stake holders and ensuring that the right tools are available for managing the pipeline. Improving our tools has also created new opportunities for changing how a release is delivered and consumed which is an important step in reducing risk.
During the summer of 2016 I managed four undergraduates from the University of Houston computer science program as they worked on two projects to improve tools used by the team. During this project I led them through the software development lifecycle by setting requirements, hosting daily standups and managing sprints. As the work progressed I held workshops to build up core skills in source control management, web development and security. In the end the resulting projects were presented to the team and the artifacts checked in and deployed for general consumption.
Near the end of my time at NetIQ I began to branch into work exploring how to expand existing products into more accepted technologies. As the products we maintained aged the principles they were developed on began to be less accepted by potential customers. The biggest example of this related to the use of thick clients instead of having a web-based interface that customers could easily access from any machine with a browser. As I investigated ways to meet customer expectations I experimented with proof of concept replacement web interfaces and web-based plugins to third-party tools like Splunk.
I was hired into NetIQ as part of the team that works on the UNIX agent therefore my primary set of responsibilities were related to maintaining all aspects of the agent including installation, configuration, debugging and content development. We supported most variants of UNIX (AIX, HP-UX, Linux, Solaris, OSF, IRIX) and provided support to the various products that used the agent. This also meant that to some extent we were expected to understand and support the core products that used our agent. These expectations created a wide range of skills that were needed in order to properly maintain the agent and associated ecosystems.
In addition to working with the UNIX team on the agent technology I was also responsible for maintaining the management interface customers use to install, configure and monitor agents. This was a Java application with the UI written in Swing that was able to connect to the agents to query health and apply configuration changes. Initially this started as basic maintenance work and improvements as I built up my understanding of the product but eventually it turned into major rewrites as the core agent technologies shifted and new requirements were provided by customers around security and management features.
To give existing customers more capacity to transcode media in the cloud we needed to move the transcode work off of the EC2 instances and use the MediaConvert service in AWS to perform the work instead. This started with a research task to understand the capabilities and limits of MediaConvert and make sure that it still met the customers requirements for speed and reliability. Once it was validated to be a good replacement I added a new codepath to our transcode tasks to allow it to perform the work in AWS rather than on the local machine.
The MediaConvert solution used AWS Lambdas and API Gateways to present a REST API that our product could exercise to negotiate where to store and retrieve media files and request transcode tasks in AWS. These APIs managed the creation of S3 buckets, translating request parameters into MediaConvert jobs and reporting progress of the transcode.
A key feature of the new monitoring and compliance product was the ability to measure and report the loudness of a given media program to verify that it is in compliance. This began with research to understand how loudness values are computed and what values are considered to be in compliance. Once we understood those requirements we began to measure audio parameters in the target media files and compute loudness values according to those algorithms. The values were then stored with the media and new user interface elements were created to allow the user to view the data and determine if the media was in compliance.
Changes to the library UI created a need to generate thumbnails for each piece of media represented. This involves post-processing the media files to pull out an image that would be a suitable thumbnail and then storing it so it could be used by the UI.
In this context a 'proxy' is an alternate version of the media, usually in a lower resolution or bitrate to facilitate watching it across a network with bandwidth limitations. The problem with generating a proxy is you are required to reprocess the entire media item and transcode it which is an expensive and sometimes long-running task. My work on this feature involved adding the backend support for launching these processes on node machines and making sure that the workload was evenly balanced across the cluster.