Multimedia Protocols

Up to this point, we've been discussing methods of exchanging real-time messages in text. There are also real-time messaging systems that allow the exchange of other kinds of data; these include Internet telephones, video conferencing systems, and application-sharing systems. These types of data require a great deal more bandwidth than plain text and often have more security implications.

Multimedia protocols tend to have several common characteristics. First, they normally use more than one port. They use multiple data streams in order to separate data with different characteristics and in order to maximize the efficiency with which they use network resources. Thus, they normally separate audio data from video data and use different channels for data going in different directions. They also separate the actual data from administrative commands, so that the port used to send video is not the same as the port used to say "Stop sending me video, I can't take it any more"; this maximizes the chances that the administrative commands will actually get through. The administrative functions are normally known as call control.

Most multimedia protocols use different lower-level protocols for data and for call control. Data is almost always sent over UDP, while call control is almost always sent over TCP. This is because the data needs a maximum of speed. It's not important if some packets are lost, as long as all the packets that get through are used as soon as they arrive. The call control, on the other hand, happens less often but must not get lost; it's worth the higher overhead of TCP in order to be guaranteed that commands will arrive.

Multimedia protocols are very difficult to protect adequately with firewalls. It would be hard to support any protocol that involved a large number of channels, going in both directions, and using both connection-oriented and connectionless protocols, but multimedia protocols further complicate the picture by requiring very high performance.

T.120 and H.323[30] are International Telecommunications Union (ITU) standards for conferencing. T.120 covers file transfer, chat, whiteboard, and application sharing; H.323 covers audio and video conferencing. These are both higher-level standards that use a number of lower-level protocols for various purposes, and you will occasionally hear people talk about Q.931, G.711, H.245, H.261, and H.263 in particular as parts of H.323, and T.122 through T.127 as parts of T.120. For most purposes, you don't need to worry about these lower-level protocols, which are used in conjunction with the higher-level protocols.

Neither the H.323 nor the T.120 standard requires implementors to provide any security. H.323 is used to carry audio and video data that will be presented to the user. Although this presents a risk of information leaks, it's not directly dangerous to the client except in the ways all protocols are dangerous to clients. Because H.323 sets up a large number of incoming data channels, both UDP and TCP, there's a significant risk that allowing H.323 will allow people to attack other, more vulnerable services.

T.120, on the other hand, is inherently dangerous. Both file transfer and application sharing are directly attackable applications.

H.323 uses at least three ports per connection. A TCP connection at port 1720 is used for call setup. In addition, each data stream requires one dynamically allocated TCP port (for call control) and one dynamically allocated UDP port (for data). Audio and data are sent separately, and data streams are one-way; this means that a normal video conference will require no less than eight dynamically allocated ports (a TCP control port and a UDP data port for outgoing video, another pair for outgoing audio, another pair for incoming video, and a final pair for incoming audio). Figure 19.3 shows the connections involved in a generic H.323 conference. Note that four of the dynamically allocated ports will be established from the outside to the inside (regardless of which side initiated the conversation).

The extensive use of dynamically allocated ports makes H.323 very hard to deal with via packet filtering; in fact, Microsoft's instructions for NetMeeting (which is based upon H.323 and mentioned later) suggest allowing all UDP and TCP connections in either direction where both ends are above 1024. This configuration is extremely insecure, and we don't recommend it. However, it is the only way to allow H.323 through a nonstateful packet filtering firewall.

A stateful packet filter that can monitor the H.323 port negotiation would be capable of allowing only the needed data ports. Note that straightforward tricks like allowing only UDP responses will not work for H.323 because the incoming data streams from the remote host will not meet the normal criteria to be considered a response; the packet filtering must be H.323-aware. Unfortunately, H.323 is not particularly easy to parse, so H.323-aware packet filters are rare, although high-end packet filtering systems do offer them.

Because H.323 does not have any built-in authentication, allowing H.323 through a packet filter is not very secure, even if you use a dynamic packet filtering system that understands H.323. If you are concerned about transmitting confidential data, or about the security of your clients, you would be better off using a proxy that provides authentication features.

H.323 has almost every characteristic that makes a protocol hard to proxy; it uses both TCP and UDP, it uses multiple ports, it uses dynamically allocated ports, it creates connections in both directions, and it embeds address information inside packets. The only good news is that the protocol provides a space where clients can specify a desired destination, making it easy for a proxy to figure out where connections should be directed.

One way of getting around the problems with proxying H.323 is to use what the standard calls a Multipoint Control Unit (MCU) and place it in a publicly accessible part of your network. These systems are designed primarily to control many-to-many connections, but they do it by having each person in the conference connect to them. It means that if you put one on a bastion-host network, you can allow both internal and external callers to connect to it, and only to it, and still get conferencing going. If this machine is well configured, it is relatively safe. However, it's not a true proxy. The external users have to be able to connect directly to the multipoint control unit; one multipoint control unit will not connect to another. The end result is that two sites that both use this workaround can't talk to each other. It works only if exactly one site in the conversation uses it. Several systems are available that provide this functionality, under various names.

It is also possible to get true H.323 proxies, which usually provide multipoint control and security features as well. In general, these are special-purpose products, not included with generic proxying packages. As we've pointed out, proxying H.323 is considerable work; it's not a minor modification to a normal proxy. However, vendors like Cisco and Microsoft that offer wide product ranges do offer H.323 proxying as part of specialized video conferencing products.

RTP is an IETF standard for transmitting real-time data (notably, audio and video). The most common use of RTP is actually as a lower-level protocol in conjunction with H.323. The standard for RTP actually details a pair of protocols; RTP transfers data, and RTCP is the control protocol. Some products that talk about RTP mean RTP in conjunction with RTCP, while others truly mean that they use RTP only, using some other protocol for control.



[30] In case you're curious, the letters "T" and "H" are the designators for the ITU subcommittees that produced the standard, and subcommittee designators are just given out in alphabetical order. They're not short for anything.