Original link: www.cmyr.net/blog/gui-fr…

From several recent discussions of GUI programming in Rust, I was left with the impression that the word “GUI” means very different things to different people.

I want to try to clarify this by first describing some of the different things people call GUI frameworks/toolkits, and then going into detail about the essential components of one of the classic desktop GUI frameworks.

Although this article isn’t specifically about Rust, it does stem from Rust: it comes largely from my experience working on Druid, the desktop version of the Rust GUI toolkit.

Once we have a common understanding of this issue, we can better discuss the state of this work in Rust, which will be the subject of a follow-up post.

What are we talking about when we talk about GUI

GUI frameworks can be many different things, with different use cases and different deployment goals. Frameworks for building embedded applications don’t run easily on the desktop either; The framework for building desktop applications will not run easily on the network.

Regardless of the specific situation, a major dividing line needs to be recognized, namely whether the framework can be expected to be tightly integrated into an existing platform or environment.

So on one side of the line are tools for building games, embedded applications, and, to a lesser extent, web applications. In this world, you are responsible for providing almost everything your application needs, and you interact closely with the underlying hardware: accepting raw input events and exporting your UI to some kind of buffer or surface. (The web is different; Here the browser vendor has done the integration for you.)

On the other side of the line are tools for building traditional desktop applications. In this world, you must tightly integrate with a number of existing platform apis, design patterns, and conventions, and this integration is a major source of your design complexity.

Before we dive into all the integrations that desktop application frameworks expect, let’s talk briefly about the first case.

Games and GUIs for embedded applications (think infotainment systems in the back of taxis, or interfaces on medical devices) differ from desktop GUIs in many ways, most of which can be thought of in terms of system integration: games and embedded applications don’t have to do that much. Typically, a game or embedded application is a world unto itself; There is only one “window” in which the application is responsible for drawing everything. Applications don’t need to worry about menus or subwindows; It doesn’t have to worry about synthesizers, or integrating with the platform’s IME system. Although they probably should, they generally do not support complex scripts. They can ignore rich text editing. They may not need to support font enumeration or fallback. They often ignore accessibility.

Of course, they have additional challenges of their own. Embedded applications must think more carefully about resource constraints and may need to avoid allocation altogether. When they do need features like complex scripts or text input, they have to implement them themselves and can’t rely on anything the system provides.

The games are similar, plus they have their own unique performance issues and considerations, and I’m not qualified to talk about any real details.

Games and embeddings are certainly interesting areas. Embedded in particular is where I think Rust GUI really makes sense, and Rust generally has a strong value proposition for embedded use for many of the same reasons.

However, a project intended for gaming or embedded development is unlikely to address the entire list of features we expect in desktop applications.

“Native desktop Applications” anatomy

The main distinguishing feature of desktop applications is their tight integration with the platform. Unlike games or embedded applications, desktop applications need to interact closely with the host operating system as well as other software.

I want to try to understand some of the main integration points that are required and some of the possible approaches that can be used to provide them.

The window is changed

The application must instantiate and manage Windows. The API should allow you to customize the look and behavior of the window, including whether the window can be resized, whether it has a title bar, and so on. The API should allow multiple Windows, and it should also somehow support patterns and subwindows that respect platform conventions. This means support for application mode Windows (such as alerts that steal focus from the entire application until processing) and window mode Windows (alerts that steal focus from a given window until processing). Modal Windows are used to implement a number of common functions, including open/save dialogs (which may be platform specific) alerts, confirmation dialogs, and standard UI elements such as combo boxes and other drop-down menus (think of a completion list of text fields).

The API must allow fine determination of the seat window relative to the parent window. In the case of combo boxes, for example, when displaying a list of options, you might want to draw the currently selected item at the same baseline position used when the list was closed.

The label

You also need to support tabs. You should be able to drag tabs from TAB groups to create new Windows, as well as drag tabs between Windows. Ideally, you want to use the platform’s native tag infrastructure, but… It’s complicated. Browsers have come up with their own implementations, probably for good reason. You would want to respect the preferences of the user around the TAB (Mac system, let our user choose to open a new window TAB, system wide), but that would be an additional complication. I’ll forgive you if you skip it, but if your framework sees a lot of use, then you’ll have someone report it as an error every month until you die, and they’re not wrong.

The menu

Closely related to the window is the menu; Desktop applications should respect platform conventions around Windows and application menus. On Windows, a menu is a component of a window. On macOS, the menu is a property of the application that is updated to reflect the commands available for active Windows. On Linux, things are less clear. If you use GTK, there are Windows and application menus, although the latter is deprecated. If you’re targeting X11 or Wayland directly, you’ll need to implement your own menu, and in theory you can do whatever you want, although the easy way is a Windows-style window menu.

Generally, there are clear conventions about which menus should be provided and which commands should be displayed in them; A well-behaved desktop application should adhere to these conventions.

drawing

To draw the content of your application, you need (at a minimum) a basic 2D graphics API. This should provide the ability to fill and stroke paths (using colors, including transparency, and radial and linear gradients), lay out text, draw images, define clipping areas, and apply transformations. Ideally, your API also provides some more advanced features, such as blending mode and blur, for shadows, and so on.

These apis exist in slightly different forms on various platforms. On macOS, you have CoreGraphics, on Windows Direct2D, and on Linux you have Cairo. One approach, then, is to present a generic API abstraction on top of these platform apis, greasing the rough edges and filling in the gaps. (This is our current approach, using the PIET library.)

This does have its drawbacks. These apis are different enough (especially in tricky areas such as text) that designing a good abstraction can be challenging and requires some jumping. Subtly different platform behavior can result in irregular rendering.

It’s easier to use the same renderer everywhere. One option might be something like Skia, the rendering engine used in Chrome and Firefox. This has the advantages of portability and consistency at the cost of binary size and compile time; The release version of Rust binaries using Schia-Safe Crate has a baseline size of about 17M (my methodology isn’t very good, but I think it’s a reasonable baseline.)

Skia is still a fairly traditional software renderer, although it does now have significant GPU support. Ultimately, though, the most exciting prospects are those that move more rendering tasks to gpus.

One of the initial challenges here is the diversity of apis for GPU programming, even for the same hardware. The same physical GPU can be connected via Metal on Apple, DirectX on Windows, and Vulkan on many other platforms. Making code portable on these platforms requires repeated implementation, some form of cross-compilation, or abstraction layer. The problem with the latter is that it is difficult to write an abstraction that provides adequate control over advanced GPU capabilities, such as computing power, across slightly different low-level apis.

Once you figure out how you want to talk to the hardware, you need to figure out how to raster 2D scenes effectively and correctly on the GPU. It may also be more complicated than you initially suspect. Since Gpus are good at drawing 3D scenes, and 3D scenes seem “more complex” than 2D scenes, it seems a natural conclusion that gpus should handle 2D easily. They don’t. Rasterization techniques used in 3D are not well suited for 2D tasks, such as clipping to vector paths or anti-aliasing, and those that produce the best results have the worst performance. To make matters worse, these traditional techniques behave badly in 2D once a large number of mixed groups or clipping areas are involved, as each area requires its own temporary buffers and draw calls.

There are some promising new jobs (such as Piet-Gpus) that use computational shaders and can draw scenes in 2D imaging models with smooth and consistent performance. This is an active area of research. One potential limitation is that computational shaders are a relatively new feature and are only available on Gpus made in the last five years or so. Other renderers, including WebRender, used by Firefox, use more traditional techniques and have wider compatibility.

animation

Oh, and: Whichever approach you choose, you also need to provide ergonomic, high-performance animation apis. It is worth considering this sooner rather than later; Trying to add it later would be annoying.

The text

No matter how you draw, you need to render text. At a minimum, GUI frameworks should support rich text, complex scripting, text layout (including things like line breaks, alignment, and alignment, and ideally things like line breaks in arbitrary paths). You need to support emoticons. You also need support for text editing, including right-to-left and BiDi support. It’s a huge undertaking. In fact, you have two options: bundle HarfBuzz or use the platform text API: CoreText on macOS, DirectWrite on Windows, and possibly HarfBuzz on Pango+ Linux. There are other alternatives, including some promising Rust projects (such as Allsorts, RustyBuzz and Swash), but none is complete enough to completely replace HarfBuzz or the platform text API.

synthesizer

2D graphics are the main part of the drawing that might be done by a desktop application, but they are not the only part. There are two other common situations worth mentioning: video and 3D graphics. In both cases, we want to be able to leverage the hardware available: for video, the hardware H.264 decoder, for 3D, the GPU. This boils down to instructing the operating system to embed a video or 3D view in an area of our window, which means interacting with a synthesizer. A synthesizer is the component of an operating system that takes display data from various sources (different Windows of different programs, video playback, GPU output) and assembles it into a coherent picture of the desktop.

Perhaps the best way to think about why this is important to us is to think about the interaction with scrolling. If you have a scrollable view and the view contains a video, you want the video to move in sync with the rest of the view as you scroll. This is harder than it sounds. You can’t define just one area of the window and embed video in it; You need to somehow tell the operating system to synchronize moving video with scrolling.

Web browser

Let’s not forget this: sooner or later someone will want to display some HTML (or an actual website!) in their application. . We really don’t want to bundle the entire browser engine for this, but using a platform WebView also involves synthesizers and complicates our lives significantly in general. Maybe your users don’t need that network view at all? Anyway, there’s something to consider.

Input processing

Once you figure out how to manage Windows and how to draw content, you need to deal with user input. We can roughly divide input into Pointers, keyboards, and others, the others being things like joysticks, gamepads, and other HID devices. We’ll ignore the last category and just say it’s fine, but it doesn’t need to be a priority. Finally, there are input events from system accessibility functions; We’ll deal with these when we talk about accessibility.

For pointer and keyboard events, there is a relatively simple approach, and then there is a principled, correct approach, but much more correct.

Pointer to the input

For pointer events, the simple approach is to present an API that sends mouse events, and then send trackpad events in a way that makes them look like mouse events: ignoring multiple touches, pressure, or other non-characteristic touch gestures have mouse-like characteristics. The hard approach is to implement some equivalent PointerEvent API on the web, where you can fully represent multi-touch information (both from the trackpad and a touch display), which and stylus input events.

The simple way to execute pointer events is… Well, assuming you can also provide events for common touchpad gestures such as pinching and two-finger scrolling, otherwise your framework will immediately frustrate many users. While the number of applications that need or want advanced gesture recognition or expect to handle stylus input is fairly small, they do exist, and the desktop application frameworks that don’t support these situations are fundamentally limited.

Keyboard input

Keyboard typing is even worse, in two ways: here, the hard case is both difficult and the “easy way” is fundamentally limited; Going the easy route means that your framework is essentially useless to a large part of the world’s population.

For keyboard input, the simple approach is very simple: the keyboard key is usually associated with a character or string, and when the user presses a key, you can take that string and insert it into the cursor position text field in the activity. This works fairly well for monolingual English texts, less well for general Latin-1 languages and scripts that behave like Latin, such as Greek or Cyrillic or Turkish, but at least one. Unfortunately (but not coincidentally), a large number of programmers mostly type ONLY ASCII, but most of the world does not. Serving these users requires integration with platform text input and IME systems, an unfortunate problem that is both essential and extremely tedious.

IME stands for Input Method Editor and is the umbrella term for platform-specific mechanisms for converting keyboard events into text. For most European languages and scripts, this process is fairly simple and you may need to insert an stressed vowel at most, but for East Asian languages (Chinese, Japanese, and Korean, or CJK collectively) it is much more complicated because of all the other complicated scripts.

It’s complicated in many ways. First, it means that the interaction between a given text field and the IME is bidirectional: the IME needs to be able to modify the contents of the text box, but it also needs to be able to query the current contents of the text box in order to have the appropriate context to interpret events. Again, you need to notify cursor position or selection state changes; The same keystroke may produce different output depending on the surrounding text. Second, we also need to keep the IME up to date in the location of the on-screen text box, because the IME often provides “candidate” Windows for possible inputs to the active sequence of keyboard events. Finally (actually not like last, I’ve written three thousand words, but I’m not done yet) implementing IME in a cross-platform manner is complicated due to the underlying platform API differences; MacOS requires editable text fields to implement the protocol and then lets the text fields handle accepting and applying changes from the IME, whereas Windows apis use lock and release mechanisms; The design abstraction of these two approaches is an additional layer of complexity.

There is an additional complication related to Text entry: on macOS, you need to support Cocoa Text System, which allows users to specify system-wide key bindings that can issue various Text editing and navigation commands.

The bottom line: Handling input correctly takes a lot of work, and if you don’t, your framework is basically a toy.

barrier-free

Desktop application frameworks must support native accessibility apis, and ideally this should be done in a way that does not require special thought or work by the application developer. Accessibility is a general term for a number of assistive technologies. The most important is support for screen readers and assisted navigation. Screen reader support means interoperating with platform apis that describe application structure and content, and assisted navigation means providing a way to move linearly between elements on the screen, allowing elements to be highlighted, described, and activated in turn using a keyboard or joystick.

In addition to these core functions, your framework should respect users’ system-level preferences such as text size, reduced color contrast, and reduced animation. Relevant, but not accessible, exactly: you want to support dark modes, and things like user-selected accent colors.

Internationalization and localization

Your framework should support internationalization. The most obvious component is localization of strings, but it also includes things like mirroring interfaces in a right-to-left locale. In addition, information such as time, date, monetary unit, calendar unit, name, sequence, and general format of numeric data should respect the user’s locale. If this isn’t something you’ve thought about before, it’s almost certainly more complicated than you think. But don’t worry: there’s a standard. All you need to do is implement it.

Other less common features

In addition to all the features shared in most desktop environments, there are platform-specific features to consider: some of these are stylistic things, such as apis for adding transparency or vitality to parts of a window; Or support adding additional menu bars or using taskbar extensions, or quick viewing, or implementing control panel items, or any number of other things. Your framework should at least make these things possible. At the very least, you should provide users with the opportunity to use the platform API directly, in case they really need to implement something you didn’t foresee (or get around to however).

combined

It felt like a reasonable place to stop; I’ve certainly overlooked a few things, but I hope I’ve touched on the most important ones. Once you have an idea of what you need to support and implement, you can start thinking about how to put it all together.

Designing cross-platform apis

One of the more subtle and interesting challenges in designing GUI frameworks is designing apis. Here, you have a very specific problem: you are trying to design an API that provides a common interface to a fundamentally different set of underlying platform apis.

A good example is the menu for your application. As mentioned earlier, Linux and Windows typically want a menu bar on each window of your application, while macOS has a menu bar that is a component of the desktop environment that becomes your application menu when your application is active.

To handle this naively, you might have separate “application” and “window” menus, and then you might have conditional code to update one or the other based on conditional compilation or runtime checking. However, this can end up with a lot of duplicate code, and it can be error-prone. In this particular case, I think there is a fairly clear, fairly simple API that works on both platforms. In your framework, you treat the menu as a property of the window: on Windows and Linux this is actually the case, so no problem, and then on macOS you set the application menu to the menu of the current active window and change it as needed when Windows gains or loses active state.

This is a fairly clean example, and many of the other apis are not so clear. In general, designing these cross-platform apis is a process of carefully reading and experimenting with platform-specific apis, and then trying to determine the shared set of features and capabilities that you can express in the abstraction described above; When there is no clear shared feature set, this means coming up with some other API that can at least be implemented based on what the platform provides.

The network view

Some major cross-platform GUI frameworks have successfully addressed all of this platform complexity and all of its subtle design flaws, missing documentation, and mysterious bugs: major browsers, Chrome, Firefox, and (increasingly) Edge. (Safari doesn’t need to worry about this, because it’s not cross-platform.)

Browsers have to figure it all out: subwindows, text input, accessibility, font rollback, synthesizers, high-performance painting, drag and drop… It’s all there.

If you want to do something cross-platform, then there’s a very natural and understandable impulse to get your hands on web technology and use it to render your UI in a native window, a la Electron, by creating a real Web application that runs in a browser, or by relying on a browser engine. That there are obvious shortcomings, especially in the performance (in various aspects, such as application size and memory consumption) and “look and feel” (we will soon expand), but it does make life more simple, I spend time on this project in the field of, the more the more I choose sympathise with the browser.

Native look and feel

One thing that often comes up when discussing how cross-platform GUIs work is a collection of what I call “native look and feel.” This is vague, and I think it’s helpful to divide it into two parts: native behavior and convention, and native appearance (although these can overlap).

Native behavior refers to many of the things we’ve already discussed, and a few more besides. Some examples are scrolling behavior: Does your application respect users’ scrolling preferences? Does your application have the same acceleration curve when scrolling as the default platform scroll view? Does your application handle standard system keyboard shortcuts, such as maximize or hide Windows? Is the input method valid? This also extends to other, less obvious conventions: does the application store user data in a conventional location on the current platform? Does it use system files to open/save dialog boxes? Does it display the expected menu with the expected menu item?

These things are more important on some platforms than others. Especially on the Mac, it’s important to get these behavioral details right: The Mac, more than other platforms, is designed around specific conventions that Mac application developers have historically struggled to adhere to. This, in turn, helps create a community of users who value and are sensitive to these conventions, and breaking them is bound to disrupt that community. On Windows, things are a little easier; Historically, there has been a wide variety of software on Windows, and Microsoft has never been as dogmatic about how applications should look and behave as Apple has been.

The native look and feel is more about the look and feel of the application. Do your buttons look like native buttons? Do they have the same size and gradient? Do you more commonly use controls that the platform expects for a given interaction, such as preferring a checkbox on the desktop but switching on a mobile device?

The fact that “native look-and-feel” changes not only from platform to platform but also from operating system version is further complicated to the point where it looks like “native” on a given machine requires runtime checks on the operating system version.

While all of this is possible, it starts to add a lot of extra work, and it’s hard to justify for projects with limited staff. For this reason, I personally forgive a project that no longer tries to make pixel-perfect copies of the platform’s built-in widgets, preferring instead to try to do something tasteful and consistent while providing the tools needed by the framework for users to style as needed.

conclusion

I hope this catalog helps at least vaguely define the scope of the problem. None of the things I’ve described here are impossible, but doing all of these things, and doing them well, is quite a lot of work.

One last point is worth concluding: in order for this work to be useful, its existence is not enough. If you want people to use your framework, you’ll have to make it attractive to them: provide good apis that are easy to use, idiomatic in the host language, well-documented, and allow them to solve their real problems.