ietf-openproxy
[Top] [All Lists]

Re: P: single assignment semantics

2003-10-30 09:33:44

On Wed, 29 Oct 2003, Andre Beck wrote:

I am not ready to cast my vote here. I think we need to consider
more use cases. How, for example, do you propose to handle a case
where a callout service wants to adapt all GIF images. and only
them? We can assume HTTP application protocol for now.

One approach would be to write a P program that forwards only those
HTTP responses to the service that do not have a content type of
"text/*".  This should significantly reduce the number of
unnecessary service invocations.

Text/* files are not that common, relatively speaking. The actual
reduction with the above technique would be less than 20% (by count)
and less than 10% (by volume). Here is a daily stats snapshot taken at
our IRCache proxies (www.ircache.net), for example:

   Top 20 Object Types:

   3173614 61% Image           9501602 kbytes 19%     3065 bytes/obj
   1035836 20% Other          22461506 kbytes 45%    22204 bytes/obj
    413219  8% Query           2264820 kbytes  5%     5612 bytes/obj
    185256  4% HTML            1666142 kbytes  3%     9209 bytes/obj
    184296  4% Directory       1993028 kbytes  4%    11073 bytes/obj
    115932  2% Lookup          1036971 kbytes  2%     9159 bytes/obj
     26990  1% Executable      5469681 kbytes 11%   207519 bytes/obj
     14140  0% Bundle          1735770 kbytes  3%   125702 bytes/obj
     10733  0% Text              68388 kbytes  0%     6524 bytes/obj
     10548  0% SHTML            152145 kbytes  0%    14770 bytes/obj
     10315  0% PDF             1355587 kbytes  3%   134573 bytes/obj
      6374  0% Audio            193083 kbytes  0%    31019 bytes/obj
      5746  0% Movie           1897433 kbytes  4%   338143 bytes/obj
      4407  0% Applet            21822 kbytes  0%     5070 bytes/obj
       981  0% Software          15537 kbytes  0%    16218 bytes/obj
        59  0% PostScript         9698 kbytes  0%   168324 bytes/obj
        14  0% ISMAP                15 kbytes  0%     1108 bytes/obj
         6  0% Drawing             518 kbytes  0%    88512 bytes/obj

But we may be getting too specific here... One could, of course, come
up with a set of nested rules that guess most images correctly based
on just HTTP headers. My concern is that if we simply prohibit
body-related functionality, there will be "too many" corner cases
where that functionality is essential.

For example, should P be able to block viruses that spread via simple
10KB HTTP POST requests? If yes, we give network operators a powerful
tool. If not, we tell network operators to rely on 3rd party modules
being available "soon enough" for each new virus.

But I agree that in this example it may make sense to give a P
program access to the first few bytes of the message body. This
would still be a small object, though. Do you have an example that
would require the handling of large objects?

I do not have a good one. If an example handles something very large,
you would say that a service should be used instead of a module.
For example, why do we think it is appropriate to prohibit support
for this kind of module/function in P:

        if (VirusVendorModule.findVirus(message.body)) {
                ...
        }

I think it is possible to support the above in P, but I am not sure it
is a good idea to do so. More precisely, I do not know where to draw
the line:

        VirusVendorModule.findVirus(message.body)
        VirusVendorModule.virusProbability(message.bodyprefix) > 30%
        ImageVendorModule.isGif(message.body)
        ImageVendorModule.gifProbability(message.bodyprefix) > 30%
        ...

Generally speaking, I would say that P programs should be restricted
to the evaluation of signaling/control and meta data which is all
typically contained in message headers. I also think that it's safe
to assume that signaling/control/meta data is typically smaller than
the data it is assciated with.

And yet you did agree that allowing P modules to peek at small parts
of the body might be useful for things like content type determination
or, perhaps, blocking viruses in POST queries.

I do share your concerns about the overheads and complexities of body
handling in P. I am trying to find a flexible scheme that will cover
most cases with ease and leave some freedom for accommodating some
corner cases with effort.

One answer could be that something like findVirus(message.body) or
isGif(message.body) MUST be implemented as a service but may have a
function-call interface to P. We already have to deal with service
execution (something that is not well documented in P or IRML yet) so
we would not be adding much more complexity; we would just need to
document an interface where a service can return a value for P
interpreter to use.

Instead of writing a bunch of sequential calls to services that would
ignore most of the content:

        apply(service1);
        apply(service2);
        apply(service3);

the programmer would be able to optimize:

        if (apply(service1) returns FooBar) {
                apply(service2);
        } else {
                apply(service3);
        }

The latter is then no different (on a conceptual level) from

        if (service1() == FooBar) {
                service2();
        } else {
                service3();

        }

Do you see what I am getting at?

In practice, I think we should look at those message properties that
OPES processors need to inspect for their normal operation anyway.
For example, do Web caches typically look at HTTP bodies in order to
decide whether or not to cache an object? If not, then maybe this
would be a message property that should rather be inspected by an
OPES service application running on a dedicated and specialized
callout server.

"Straight" web caches look at bodies only if they need to handle
transfer encodings and such. However, there are some HTTP proxies
that adapt bodies. We can say that those proxies are out of scope;
they should be using services instead.

I think whenever the evaluation of a property is likely to be
expensive and could therefore interfere with the normal operation of
an OPES processor, then we should disallow this operation in P.

If we try to do something like that, we are still faced with the same
specification problem: How do you define "likely to be expensive"? Is
it expensive for 10% of use cases? 90%?

Is looking at the first 100 bytes of a body expensive? Is looking at
all already available bytes of a body expensive? What if the entire
10MB body is available because it is cached or pre-positioned on the
proxy?

Alex.





<Prev in Thread] Current Thread [Next in Thread>