header {* Regular Expressions *}

(*<*)theory Exercises4 imports Main begin(*>*)

text {*
This assignment \emph{will be assessed} to determine 50\% of your
final mark. Please complete the indicated tasks and write a brief
document explaining your work. You may prepare this document using
Isabelle's theory presentation facility, but this is not required. (A
very simple way to print a theory file legibly is to use the Proof
General command Isabelle~$>$ Commands~$>$ Display draft. You can
combine the resulting output with a document produced using your
favourite word processing package.) A clear write-up describing
elegant, clearly structured proofs of all tasks will receive maximum
credit.

You must work on this assignment as an individual.  Collaboration is
not permitted.
*}

text {*
Consider reading, e.g.,
\url{http://en.wikipedia.org/wiki/Regular_expression} to refresh your
knowledge of regular expressions.

\smallskip

For this assignment, we define \emph{regular expressions} (over an
arbitrary type @{typ 'a} of characters) as follows:
\begin{enumerate}
\item $\emptyset$ is a regular expression.
\item $\varepsilon$ is a regular expression.
\item If $c$ is of type @{typ 'a}, then $\mathit{Atom}(c)$ is a
regular expression.
\item If $x$ and $y$ are regular expressions, then $xy$ is a regular
expression.
\item If $x$ and $y$ are regular expressions, then $x+y$ is a regular
expression.
\item If $x$ is a regular expression, then $x^*$ is a regular
expression.
\end{enumerate}
Nothing else is a regular expression.

\smallskip

$\rhd$ Define a corresponding Isabelle/HOL data type.  (Your concrete
syntax may be different from that used above.  For instance, you could
write @{term "Star x"} for $x^*$.)
*}

(*<*)typedecl 'a regexp(*>*)

subsection {* Regular Languages *}

text {* A \emph{word} is a list of characters: *}
type_synonym 'a word = "'a list"

text {*
Regular expressions denote formal languages, i.e., sets of words.  For
$x$ a regular expression, we define its \emph{language} $L(x)$ as
follows:
\begin{enumerate}
\item $L(\emptyset) = \emptyset$.
\item $L(\varepsilon) = \{[\,]\}$.
\item $L(\mathit{Atom}(c)) = \{[c]\}$.
\item $L(xy) = \{ uv \mid u \in L(x) \wedge v \in L(y) \}$.
\item $L(x+y) = L(x) \cup L(y)$.
\item $L( x^* )$ is the smallest set that contains the empty word and
is closed under concatenation with words in $L(x)$.  That is,
(i)~$[\,] \in L( x^* )$, and (ii)~if $u \in L(x)$ and $v \in L( x^*
)$, then $uv \in L( x^* )$.
\end{enumerate}

$\rhd$ Define a function @{term L} that maps regular expressions to
their language.
*}

(*<*)consts(*>*)    L :: "'a regexp \<Rightarrow> 'a word set"

text {*
$\rhd$ Prove the following lemma.
*}

lemma "L (Star (Star x)) = L (Star x)"
(*<*)oops(*>*)

subsection {* Matching via Derivatives *}

text {*
We now consider regular expression \emph{matching}: the problem of
determining whether a given word is in the language of a given regular
expression.  You are about to develop your own verified regular
expression matcher.  We need some auxiliary notions first.

A regular expression is called \emph{nullable} iff its language
contains the empty word.

$\rhd$ Define a recursive function @{term "nullable x"} that computes
(by recursion over @{term x}, i.e., without explicit use of @{term L})
whether a regular expression is nullable.
*}

(*<*)consts(*>*)    nullable :: "'a regexp \<Rightarrow> bool"

text {*
$\rhd$ Prove the following lemma.
*}

lemma(*<*)nullable_correct:(*>*) "nullable x = ([] \<in> L x)"
(*<*)oops(*>*)

text {*
\pagebreak

The \emph{derivative} of a language~$\mathcal{L}$ with respect to a
word~$u$ is defined to be $\delta_u\,\mathcal{L} = \{ v \mid uv \in
\mathcal{L} \}$.

For languages that are given by regular expressions, there is a
natural algorithm to compute the derivative as another regular
expression.

$\rhd$ Define a recursive function @{term "\<Delta> c x"} that computes (by
recursion over @{term x}) a regular expression whose language is the
derivative of @{term "L x"} with respect to the single-character word
@{term "[c]"}.
*}

(*<*)consts(*>*)    \<Delta> :: "'a \<Rightarrow> 'a regexp \<Rightarrow> 'a regexp"

text {*
Hint: @{term nullable} might come in handy.

\smallskip

$\rhd$ Prove the following lemma.
*}

lemma(*<*)Delta_correct:(*>*) "u \<in> L (\<Delta> c x) = (c#u \<in> L x)"
(*<*)oops(*>*)

text {*
Hint: see the \emph{Tutorial on Isabelle/HOL} and the \emph{Tutorial
on Isar} for advanced induction techniques.

\smallskip

$\rhd$ Define a recursive function @{term \<delta>} that lifts @{term \<Delta>} from
single characters to words, i.e., @{term "\<delta> u x"} is a regular
expression whose language is the derivative of @{term "L x"} with
respect to the word @{term u}.
*}

(*<*)consts(*>*)    \<delta> :: "'a word \<Rightarrow> 'a regexp \<Rightarrow> 'a regexp"

text {*
$\rhd$ Prove the following lemma.
*}

lemma(*<*)delta_correct:(*>*) "u \<in> L (\<delta> v x) = (v @ u \<in> L x)"
(*<*)oops(*>*)

text {*
To obtain a regular expression matcher, we finally observe that @{term
"u \<in> L x"} if and only if @{term "\<delta> u x"} is nullable.
*}

definition match :: "'a word \<Rightarrow> 'a regexp \<Rightarrow> bool" where
  "match u x = nullable (\<delta> u x)"

text {*
$\rhd$ Prove correctness of @{term match}.
*}

theorem "match u x = (u \<in> L x)"
(*<*)oops(*>*)

(*<*)end(*>*)