-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathintro.tex
174 lines (163 loc) · 7.66 KB
/
intro.tex
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
% ------------------------------------------------------------------------------
\section{Introduction}
\label{sec:intro}
% background: conventional grammars:
The task of parsing may be viewed as receiving a document in an
unstructured, serialized form, and trustably building its structured
representation.
%
For formats defined in conventional data-description languages that
correspond closely to well-understood classes of the Chomsky hierarchy
(i.e., context-free grammars, with regular expressions as a critical
special case), arguments of trustworthiness fall out naturally from
the definition of the format itself.
%
This is in part due to the fact that in such formats, the structured
representation of a large segment of a message is defined purely in
terms of the structure of its segments.
% real formats have DOMs, which involve names and references
However, these key properties concerning context-freedom critically do
not hold for many practical formats, which define a \emph{Document
Object Model} (DOM): i.e., the result of parsing a document is a
graph between objects, each of which may contain values bound to a
large set of fields.
%
In such formats, it is infeasible to provide context-free definitions
of well-formedness because data objects that must satisfy critical
relations may not occur contiguously: related objects may not form a
tree-like hierarchy in the input stream.
%
Such formats typically introduce a notion of \emph{naming} or
\emph{reference} by which objects may refer to each other.
%
A critical practical example of this design pattern---and the
motivating example of our work---is the document object model of the
\emph{Portable Document Format (PDF)}~\cite{isotc171sc2wg8ISO32000220202020};
%
PDF data objects include an \emph{object identifier}, by which other
objects may refer to them.
% further complications: people want to do other fancy tricks with
% context tables
In such formats, the structures of references and context take on
central importance.
%
In practice, their structure is quite complex in order to support
practical demands.
%
E.g., PDF's \emph{cross-reference table} enables
%
\textbf{(1)} the interpretation of documents to be strongly mutated by
appending content, via \emph{incremental updates};
%
\textbf{(2)} the reference structure to be compressed, via
standardized but non-trivial algorithms applied to
\emph{cross-reference streams}, potentially combined with conventional
cross-reference tables in \emph{hybrid reference files}; and
%
\textbf{(3)} large documents to be partially processed incrementally,
using separated context, via \emph{linearization}.
%
The parsing of this reference and context information occurs before
DOM creation and DOM object validation, introducing a reliance on the
correctness of all \emph{pre-DOM} processing.
% security consequences:
Difficulties in expressing the structure and semantics of references
have resulted in critical security vulnerabilities.
%
The induced \emph{ambiguities} cause different parties
to assign wildly different semantic interpretations to the same
document.
%
Recently discovered attacks that compromise the integrity and
usability of digital
signatures~\cite{rohlmannBreakingSpecificationPDF2021,
mainkaShadowAttacksHiding2021} use maliciously crafted
cross-reference tables. Our work has identified additional exploits
against digital signatures and PDF file integrity based on our
formalisms of PDF file structure and layout~\cite{cve25641}.
Furthermore, such formats invite document \emph{cavities}---segments
of a document that are not reflected in its semantic
interpretation---may store content that is completely unobservable to
parser clients.
%
Such cavities are a powerful mechanism for creating \emph{polyglot
files} (i.e., files that unexpectedly belong to multiple formats),
which themselves have been used in recent critical system security
exploits~\cite{psychicPaper}.
Thus, even the \emph{pre-DOM} parsing and computation
is surprisingly complex (as described in \Cref{sec:pdf,sec:parsing}).
% potential solutions and why they fail:
Fully defining this pre-DOM computation cannot be done
with context-free grammars and weaker formalisms.
%
In the conventional setting, a parser returns a semantic value that is
then potentially transformed by further computation, which itself be
defined in an attribute grammar or parser client logic.
%
The main limitation of such an approach is that computation on
semantic values must then itself effectively parse unstructured data
after computing partial contextual information;
%
such parsing logic is exactly what should be expressed declaratively
in a grammar and implemented by a generated parser.
% our solution: very careful parser combinators:
This paper explores a more powerful alternative: monadic parsers equipped
with actions for explicitly capturing and setting the parser's input, run
in a well-founded sequence of staged computations over semantic values.
%
Using parser combinators is not a new idea: they are available in the
distributions of modern industrial strength
languages~\cite{leijen2001parsec,couprie2015nom,mundkurResearchReportParsley2020,bratus2017curing,willis2020staged}.
%
This approach is explored within an industrial strength case study:
validating and parsing the reference tables that are needed to create
an unambiguous and trustworthy PDF DOM.
In \Cref{sec:specifying,sec:assessing}, we formalize the pre-DOM processing using
monadic parsers, with Haskell as our ``computable specification language.''
There we show executable code that formalizes the subtle parts of
pre-DOM parsing.
% nikhil's tweak:
However, the specification is not yet in itself a complete
reference implementation as it excludes the primitive
parsers as well as excludes some of the more tedious code.
In \Cref{sec:implementation}, we describe our \emph{implementation} in which we,
% nikhil's tweak:
informed by our specification, separately wrote procedural code and
integrated this with the primitive parsers and other required
components to create a complete (but less manifestly correct)
PDF parser.
Our work is unique in that, to our knowledge, it constitutes
the first attempt to use a declarative approach to formalize a comprehensive
set of features and integrity relationships in PDF pre-DOM processing
that define referential context, specifically cross reference tables,
incremental updates, and cross reference table compression within
cross-reference streams.
These are the first stages in the PDF ``Trust Chain,'' and strongly complements all
efforts that rely on a trustworthy formalization of reference in order
to validate properties of higher-level document abstraction defined in
terms of a DOM.
Our pre-DOM formalization is more subtle than what may be often be
implemented by inspecting the PDF standard or many extant documents:
in the process of producing our definition, we raised several issues
with the current PDF standard which have been acknowledged and
addressed by the PDF Association and the ISO.
%
However, there is nothing in the format definition that requires the
specific language of combinators to be employed: a key goal of our work is
to provide this formalized definition as a worked case study, to be
vetted and improved upon using definitions in other experimental data
definition languages as they are developed.
\paragraph*{Organization}
%
\Cref{sec:pdf} presents PDF, its complexities, and vulnerabilities;
%
\Cref{sec:parsing} discusses the details and surprising
complexities of actually \emph{parsing} the PDF format;
%
\Cref{sec:specifying,sec:assessing} present and analyze our
specification of reference in detail;
%
\Cref{sec:implementation} discusses our implementation;
%
\Cref{sec:rel-work} reviews related work, and %
\Cref{sec:conclusion} concludes.