James F. Allen (computer scientist)

James F. Allen (computer scientist)

James Frederick Allen (born 1950) is an American computational linguist recognized for his contributions to temporal logic, in particular Allen's interval algebra. He is interested in knowledge representation, commonsense reasoning, and natural language understanding, believing that "deep language understanding can only currently be achieved by significant hand-engineering of semantically-rich formalisms coupled with statistical preferences". He is the John H. Dessaurer Professor of Computer Science at the University of Rochester. == Biography == Allen received his Ph.D. from the University of Toronto in 1979, under the supervision of C. Raymond Perrault, after which he joined the faculty at Rochester. At Rochester, he was department chair from 1987 to 1990, directed the Cognitive Science Program from 1992 to 1996, and co-directed the Center for the Sciences of Language from 1996 to 1998. He served as the Editor-in-Chief of Computational Linguistics from 1983–1993. Since 2006 he has also been associate director of the Florida Institute for Human and Machine Cognition. == Academic life == === TRIPS project === The TRIPS project is a long-term research to build generic technology for dialogue (both spoken and 'chat') systems, which includes natural language processing, collaborative problem solving, and dynamic context-sensitive language modeling. This is contrast with the data driven approaches by machine learning, which requires to collect and annotate corpora, i.e. training data, firstly. === PLOW agent === PLOW agent is a system that learns executable task models from a single collaborative learning session, which integrates wide AI technologies including deep natural language understanding, knowledge representation and reasoning, dialogue systems, planning/agent-based systems, and machine learning. This paper won the outstanding paper award at AAAI in 2007. == Selected works == === Books === Allen is the author of the textbook Natural Language Understanding (Benjamin-Cummings, 1987; 2nd ed., 1995). He is also the co-author with Henry Kautz, Richard Pelavin, and Josh Tenenberg of Reasoning About Plans (Morgan Kaufmann, 1991). === Articles === 2007. PLOW: A Collaborative Task Learning Agent. (with Nathanael Chambers et al) AAAI'07 won the outstanding paper award at AAAI in 2007. 2006. Chester: Towards a Personal Medication Advisor. (with N. Blaylock, et al) Biomedical informatics 39(5) 1998. TRIPS: An Integrated Intelligent Problem-Solving Assistant. (with George Ferguson) AAAI'98 1983. Maintaining Knowledge about Temporal Intervals. CACM 26, 11, 832-843 == Awards and honors == In 1991 he was elected as a fellow of the Association for the Advancement of Artificial Intelligence (1990, founding fellow). In 1992 he became the Dessaurer Professor at Rochester.

Hybrid machine translation

Hybrid machine translation is a method of machine translation that is characterized by the use of multiple machine translation approaches within a single machine translation system. The motivation for developing hybrid machine translation systems stems from the failure of any single technique to achieve a satisfactory level of accuracy. Many hybrid machine translation systems have been successful in improving the accuracy of the translations, and there are several popular machine translation systems which employ hybrid methods. == Approaches == === Multi-engine === This approach to hybrid machine translation involves running multiple machine translation systems in parallel. The final output is generated by combining the output of all the sub-systems. Most commonly, these systems use statistical and rule-based translation subsystems, but other combinations have been explored. For example, researchers at Carnegie Mellon University have had some success combining example-based, transfer-based, knowledge-based and statistical translation sub-systems into one machine translation system. === Statistical rule generation === This approach involves using statistical data to generate lexical and syntactic rules. The input is then processed with these rules as if it were a rule-based translator. This approach attempts to avoid the difficult and time-consuming task of creating a set of comprehensive, fine-grained linguistic rules by extracting those rules from the training corpus. This approach still suffers from many problems of normal statistical machine translation, namely that the accuracy of the translation will depend heavily on the similarity of the input text to the text of the training corpus. As a result, this technique has had the most success in domain-specific applications, and has the same difficulties with domain adaptation as many statistical machine translation systems. === Multi-Pass === This approach involves serially processing the input multiple times. The most common technique used in multi-pass machine translation systems is to pre-process the input with a rule-based machine translation system. The output of the rule-based pre-processor is passed to a statistical machine translation system, which produces the final output. This technique is used to limit the amount of information a statistical system need consider, significantly reducing the processing power required. It also removes the need for the rule-based system to be a complete translation system for the language, significantly reducing the amount of human effort and labor necessary to build the system. === Confidence-Based === This approach differs from the other hybrid approaches in that in most cases only one translation technology is used. A confidence metric is produced for each translated sentence from which a decision can be made whether to try a secondary translation technology or to proceed with the initial translation output. SMT is also used when common error patterns such as multiple repeat words appear in sequence, as is common with NMT when the attention mechanism is confused.

Machine-readable medium and data

In communications and computing, a machine-readable medium (or computer-readable medium) is a medium capable of storing data in a format easily readable by a digital computer or a sensor. It contrasts with human-readable medium and data. The result is called machine-readable data or computer-readable data, and the data itself can be described as having machine-readability. == Data == Machine-readable data must be structured data. Attempts to create machine-readable data occurred as early as the 1960s. At the same time that seminal developments in machine-reading and natural-language processing were releasing (like Weizenbaum's ELIZA), people were anticipating the success of machine-readable functionality and attempting to create machine-readable documents. One such example was musicologist Nancy B. Reich's creation of a machine-readable catalog of composer William Jay Sydeman's works in 1966. In the United States, the OPEN Government Data Act of 14 January 2019 defines machine-readable data as "data in a format that can be easily processed by a computer without human intervention while ensuring no semantic meaning is lost." The law directs U.S. federal agencies to publish public data in such a manner, ensuring that "any public data asset of the agency is machine-readable". Machine-readable data may be classified into two groups: human-readable data that is marked up so that it can also be read by machines (e.g. microformats, RDFa, HTML), and data file formats intended principally for processing by machines (CSV, RDF, XML, JSON). These formats are only machine readable if the data contained within them is formally structured; exporting a CSV file from a badly structured spreadsheet does not meet the definition. Machine readable is not synonymous with digitally accessible. A digitally accessible document may be online, making it easier for humans to access via computers, but its content is much harder to extract, transform, and process via computer programming logic if it is not machine-readable. Extensible Markup Language (XML) is designed to be both human- and machine-readable, and Extensible Stylesheet Language Transformations (XSLT) is used to improve the presentation of the data for human readability. For example, XSLT can be used to automatically render XML in Portable Document Format (PDF). Machine-readable data can be automatically transformed for human-readability but, generally speaking, the reverse is not true. For purposes of implementation of the Government Performance and Results Act (GPRA) Modernization Act, the Office of Management and Budget (OMB) defines "machine readable format" as follows: "Format in a standard computer language (not English text) that can be read automatically by a web browser or computer system. (e.g.; xml). Traditional word processing documents and portable document format (PDF) files are easily read by humans but typically are difficult for machines to interpret. Other formats such as extensible markup language (XML), (JSON), or spreadsheets with header columns that can be exported as comma separated values (CSV) are machine readable formats. As HTML is a structural markup language, discreetly labeling parts of the document, computers are able to gather document components to assemble tables of contents, outlines, literature search bibliographies, etc. It is possible to make traditional word processing documents and other formats machine readable but the documents must include enhanced structural elements." == Media == Examples of machine-readable media include magnetic media such as magnetic disks, cards, tapes, and drums, punched cards and paper tapes, optical discs, barcodes and magnetic ink characters. Common machine-readable technologies include magnetic recording, processing waveforms, and barcodes. Optical character recognition (OCR) can be used to enable machines to read information available to humans. Any information retrievable by any form of energy can be machine-readable. Examples include: Acoustics Chemical Photochemical Electrical Semiconductor used in volatile RAM microchips Floating-gate transistor used in non-volatile memory cards Radio transmission Magnetic storage Mechanical Tins And Swins Punched card Paper tape Music roll Music box cylinder or disk Grooves (See also: Audio Data) Phonograph cylinder Gramophone record DictaBelt (groove on plastic belt) Capacitance Electronic Disc Optics Optical storage Thermodynamic == Applications == === Documents === === Catalogs === === Dictionaries === === Passports ===

Lillian Lee (computer scientist)

Lillian Lee is a computer scientist whose research involves natural language processing, sentiment analysis, and computational social science. She is a professor of computer science and information science at Cornell University, and co-editor-in-chief of the journal Transactions of the Association for Computational Linguistics. == Education == Lee graduated from Cornell University in 1993 with an undergraduate degree in math and science. She completed her Ph.D. at Harvard University in 1997. Her dissertation, Similarity-Based Approaches to Natural Language Processing, was supervised by Stuart M. Shieber. == Career == Lee has been a member of the Cornell faculty since 1997. == Recognition == Lee has been a fellow of the Association for the Advancement of Artificial Intelligence since 2013, and of the Association for Computational Linguistics since 2017. Lee was elected as an ACM Fellow in 2018 for "contributions to natural language processing, sentiment analysis, and computational social science".

Julie Beth Lovins

Julie Beth Lovins (October 19, 1945, in Washington, D.C. – January 26, 2018, in Mountain View, California) was a computational linguist who published The Lovins Stemming Algorithm - a type of stemming algorithm for word matching - in 1968. The Lovins Stemmer is a single pass, context sensitive stemmer, which removes endings based on the longest-match principle. The stemmer was the first to be published and was extremely well developed considering the date of its release, having been the main influence on a large amount of the future work in the area. -Adam G., et al == Background == Born on October 19, 1945, in Washington, D.C., Lovins grew up in Amherst, Massachusetts. Her father Gerald H. Lovins was an engineer and her mother, Miriam Lovins, a social services administrator. Lovins' brother Amory Lovins is the co-founder and chief environmental scientist of Rocky Mountain Institute. For her undergraduate degree, Lovins attended Pembroke College, the women's college of Brown University, which later combined into Brown University in 1971. At Pembroke College, Lovins studied mathematics and linguistics, graduating with honors. Her thesis was named, A Study of Idioms. She received the inaugural Bloch Fellowship in 1970 from the Linguistic Society of America to attend graduate school. Lovins obtained her Master of Arts in 1970 and Doctor of Philosophy in 1973 from the University of Chicago, studying linguistics. At the University of Chicago, her dissertation was titled, Loan Phonology -- Subject Matter. A revision of her thesis on loanwords and the phonological structure of Japanese was published in 1975 by the Indiana University Linguistics Club. == Teaching career == Following Lovins' PhD, she spent a year working as a linguist-at-large at a University of Tokyo language research institute and as an English conversation teacher. She then joined the faculty at Tsuda College as a professor of English and linguistics, where she taught for seven years. During her time as a faculty member at Tsuda College, Lovins also served as a guest researcher in the University of Tokyo's Research Institute of Logopedics and Phoniatrics, a research center for speech science. == Industry career == After teaching Japanese phonology at Japanese universities abroad, Lovins moved back to the U.S. to work in the computing industry. She worked on early speech synthesis at Bell Labs in Murray Hill, New Jersey. At Bell Labs, Lovins worked with Osamu Fujimura, a Japanese linguist who is credited as a pioneer in speech sciences. Lovins also worked as a software engineer at various companies in Silicon Valley and served as a consultant for computational linguistics throughout the 1990s. As a consultant, she called her business, "The Language Doctor." == The Lovins Stemming Algorithm == Lovins published an article about her work on developing a stemming algorithm through the Research Laboratory of Electronics at MIT in 1968. Lovins' stemming algorithm is frequently referred to as the Lovins stemmer. A stemming algorithm is the process of taking a word with suffixes and reducing it to its root, or base word. Stemming algorithms are used to improve the accuracy in information retrieval and in domain analysis. These algorithms help find variants of the terms being queried. Stemming algorithms bring value in their reduction of a given query into its less complex form, allowing more similar documents to be retrieved for similar queries. Stemming algorithms are prevalent in search engines, such as Google Search, which did not implement word stemming until 2003. This means that up until 2003, a Google search for the word warm would not have explicitly returned results for related words like warmth or warming. As the first published stemming algorithm, Lovins' work set a precedent and influenced future work in stemming algorithms, such as the Porter Stemmer published by Martin Porter in 1980 which has been recognized widely as the most common stemming algorithm for stemming English. Additionally, the Dawson Stemmer developed by John Dawson is an extension of the Lovins stemmer. The Lovins stemmer follows a rule-based affix elimination approach. It first removes the longest identifiable suffix from the target word - producing a base stem word - then indexes a lookup table to convert the (potentially malformed) stem word to a valid word. This process can be split into two phases. In the first phase, a word is compared with a pre-determined list of endings, and when a word is found to contain one of these endings, the ending is removed, leaving only the stem of the word. The second phase standardizes spelling exceptions that come from the first phase, ensuring that words with only marginally varying stems are appropriately paired together. For example, with the word dried, phase one results in dri, which should match with the word dry. The second phase takes care of these exceptions. Compared to other stemmers, Lovins' algorithm is fast and equipped to handle irregular plural words like person and people. Disadvantages, however, include many suffixes not being available in the table of endings. Furthermore, it is sometimes highly unreliable and frequently fails to form valid words from the stems or to match the stems of like-meaning words. This is most often caused by the usage of specialist terminology and domain-specific vocabulary by the author. == Personal life == Lovins moved to Mountain View, California, in 1979, and later to Old Mountain View in 1981 with her partner and later husband Greg Fowler, a software engineer and advocate for environmental issues & the blind. In their free time, she and her husband enjoyed taking walks and volunteering for their local community. Lovins actively volunteered for organizations like the Old Mountain View Neighborhood Association, Mountain View Friends of the Library, League of Women Voters, Mountain View Cool Cities Team, and the Mountain View Sustainability Task Force. In 2016, Lovins' husband died unexpectedly, following a heart attack. Eighteen days after her husband died, Lovins was diagnosed with brain cancer. She died on January 26, 2018, at a hospice, surrounded by friends, family and caregivers.

IT baseline protection

The IT baseline protection (German: IT-Grundschutz) approach from the German Federal Office for Information Security (BSI) is a methodology to identify and implement computer security measures in an organization. The aim is the achievement of an adequate and appropriate level of security for IT systems. To reach this goal the BSI recommends "well-proven technical, organizational, personnel, and infrastructural safeguards". Organizations and federal agencies show their systematic approach to secure their IT systems (e.g. Information Security Management System) by obtaining an ISO/IEC 27001 Certificate on the basis of IT-Grundschutz. == Overview baseline security == The term baseline security signifies standard security measures for typical IT systems. It is used in various contexts with somewhat different meanings. For example: Microsoft Baseline Security Analyzer: Software tool focused on Microsoft operating system and services security Cisco security baseline: Vendor recommendation focused on network and network device security controls Nortel baseline security: Set of requirements and best practices with a focus on network operators ISO/IEC 13335-3 defines a baseline approach to risk management. This standard has been replaced by ISO/IEC 27005, but the baseline approach was not taken over yet into the 2700x series. There are numerous internal baseline security policies for organizations, The German BSI has a comprehensive baseline security standard, that is compliant with the ISO/IEC 27000-series == BSI IT baseline protection == The foundation of an IT baseline protection concept is initially not a detailed risk analysis. It proceeds from overall hazards. Consequently, sophisticated classification according to damage extent and probability of occurrence is ignored. Three protection needs categories are established. With their help, the protection needs of the object under investigation can be determined. Based on these, appropriate personnel, technical, organizational and infrastructural security measures are selected from the IT Baseline Protection Catalogs. The Federal Office for Security in Information Technology's IT Baseline Protection Catalogs offer a "cookbook recipe" for a normal level of protection. Besides probability of occurrence and potential damage extents, implementation costs are also considered. By using the Baseline Protection Catalogs, costly security analyses requiring expert knowledge are dispensed with, since overall hazards are worked with in the beginning. It is possible for the relative layman to identify measures to be taken and to implement them in cooperation with professionals. The BSI grants a baseline protection certificate as confirmation for the successful implementation of baseline protection. In stages 1 and 2, this is based on self declaration. In stage 3, an independent, BSI-licensed auditor completes an audit. Certification process internationalization has been possible since 2006. ISO/IEC 27001 certification can occur simultaneously with IT baseline protection certification. (The ISO/IEC 27001 standard is the successor of BS 7799-2). This process is based on the new BSI security standards. This process carries a development price which has prevailed for some time. Corporations having themselves certified under the BS 7799-2 standard are obliged to carry out a risk assessment. To make it more comfortable, most deviate from the protection needs analysis pursuant to the IT Baseline Protection Catalogs. The advantage is not only conformity with the strict BSI, but also attainment of BS 7799-2 certification. Beyond this, the BSI offers a few help aids like the policy template and the GSTOOL. One data protection component is available, which was produced in cooperation with the German Federal Commissioner for Data Protection and Freedom of Information and the state data protection authorities and integrated into the IT Baseline Protection Catalog. This component is not considered, however, in the certification process. == Baseline protection process == The following steps are taken pursuant to the baseline protection process during structure analysis and protection needs analysis: The IT network is defined. IT structure analysis is carried out. Protection needs determination is carried out. A baseline security check is carried out. IT baseline protection measures are implemented. Creation occurs in the following steps: IT structure analysis (survey) Assessment of protection needs Selection of actions Running comparison of nominal and actual. === IT structure analysis === An IT network includes the totality of infrastructural, organizational, personnel, and technical components serving the fulfillment of a task in a particular information processing application area. An IT network can thereby encompass the entire IT character of an institution or individual division, which is partitioned by organizational structures as, for example, a departmental network, or as shared IT applications, for example, a personnel information system. It is necessary to analyze and document the information technological structure in question to generate an IT security concept and especially to apply the IT Baseline Protection Catalogs. Due to today's usually heavily networked IT systems, a network topology plan offers a starting point for the analysis. The following aspects must be taken into consideration: The available infrastructure, The organizational and personnel framework for the IT network, Networked and non-networked IT systems employed in the IT network. The communications connections between IT systems and externally, IT applications run within the IT network. === Protection needs determination === The purpose of the protection needs determination is to investigate what protection is sufficient and appropriate for the information and information technology in use. In this connection, the damage to each application and the processed information, which could result from a breach of confidentiality, integrity or availability, is considered. Important in this context is a realistic assessment of the possible follow-on damages. A division into the three protection needs categories "low to medium", "high" and "very high" has proved itself of value. "Public", "internal" and "secret" are often used for confidentiality. === Modelling === Heavily networked IT systems typically characterize information technology in government and business these days. As a rule, therefore, it is advantageous to consider the entire IT system and not just individual systems within the scope of an IT security analysis and concept. To be able to manage this task, it makes sense to logically partition the entire IT system into parts and to separately consider each part or even an IT network. Detailed documentation about its structure is prerequisite for the use of the IT Baseline Protection Catalogs on an IT network. This can be achieved, for example, via the IT structure analysis described above. The IT Baseline Protection Catalog’s' components must ultimately be mapped onto the components of the IT network in question in a modelling step. === Baseline security check === The baseline security check is an organisational instrument offering a quick overview of the prevailing IT security level. With the help of interviews, the status quo of an existing IT network (as modelled by IT baseline protection) relative to the number of security measures implemented from the IT Baseline Protection Catalogs are investigated. The result is a catalog in which the implementation status "dispensable", "yes", "partly", or "no" is entered for each relevant measure. By identifying not yet, or only partially, implemented measures, improvement options for the security of the information technology in question are highlighted. The baseline security check gives information about measures, which are still missing (nominal vs. actual comparison). From this follows what remains to be done to achieve baseline protection through security. Not all measures suggested by this baseline check need to be implemented. Peculiarities are to be taken into account! It could be that several more or less unimportant applications are running on a server, which have lesser protection needs. In their totality, however, these applications are to be provided with a higher level of protection. This is called the (cumulation effect). The applications running on a server determine its need for protection. Several IT applications can run on an IT system. When this occurs, the application with the greatest need for protection determines the IT system’s protection category. Conversely, it is conceivable that an IT application with great protection needs does not automatically transfer this to the IT system. This may happen because the IT system is configured redundantly, or because only an inconsequential part is running on it. This is called the (distribution effect). This is the case, fo

Quotient automaton

In computer science, in particular in formal language theory, a quotient automaton can be obtained from a given nondeterministic finite automaton by joining some of its states. The quotient recognizes a superset of the given automaton; in some cases, handled by the Myhill–Nerode theorem, both languages are equal. == Formal definition == A (nondeterministic) finite automaton is a quintuple A = ⟨Σ, S, s0, δ, Sf⟩, where: Σ is the input alphabet (a finite, non-empty set of symbols), S is a finite, non-empty set of states, s0 is the initial state, an element of S, δ is the state-transition relation: δ ⊆ S × Σ × S, and Sf is the set of final states, a (possibly empty) subset of S. A string a1...an ∈ Σ is recognized by A if there exist states s1, ..., sn ∈ S such that ⟨si-1,ai,si⟩ ∈ δ for i=1,...,n, and sn ∈ Sf. The set of all strings recognized by A is called the language recognized by A; it is denoted as L(A). For an equivalence relation ≈ on the set S of A’s states, the quotient automaton A/≈ = ⟨Σ, S/≈, [s0], δ/≈, Sf/≈⟩ is defined by the input alphabet Σ being the same as that of A, the state set S/≈ being the set of all equivalence classes of states from S, the start state [s0] being the equivalence class of A’s start state, the state-transition relation δ/≈ being defined by δ/≈([s],a,[t]) if δ(s,a,t) for some s ∈ [s] and t ∈ [t], and the set of final states Sf/≈ being the set of all equivalence classes of final states from Sf. The process of computing A/≈ is also called factoring A by ≈. == Example == For example, the automaton A shown in the first row of the table is formally defined by ΣA = {0,1}, SA = {a,b,c,d}, sA0 = a, δA = { ⟨a,1,b⟩, ⟨b,0,c⟩, ⟨c,0,d⟩ }, and SAf = { b,c,d }. It recognizes the finite set of strings { 1, 10, 100 }; this set can also be denoted by the regular expression "1+10+100". The relation (≈) = { ⟨a,a⟩, ⟨a,b⟩, ⟨b,a⟩, ⟨b,b⟩, ⟨c,c⟩, ⟨c,d⟩, ⟨d,c⟩, ⟨d,d⟩ }, more briefly denoted as a≈b,c≈d, is an equivalence relation on the set {a,b,c,d} of automaton A’s states. Building the quotient of A by that relation results in automaton C in the third table row; it is formally defined by ΣC = {0,1}, SC = {a,c}, sC0 = a, δC = { ⟨a,1,a⟩, ⟨a,0,c⟩, ⟨c,0,c⟩ }, and SCf = { a,c }. It recognizes the finite set of all strings composed of arbitrarily many 1s, followed by arbitrarily many 0s, i.e. { ε, 1, 10, 100, 1000, ..., 11, 110, 1100, 11000, ..., 111, ... }; this set can also be denoted by the regular expression "10". Informally, C can be thought of resulting from A by glueing state a onto state b, and glueing state c onto state d. The table shows some more quotient relations, such as B = A/a≈b, and D = C/a≈c. == Properties == For every automaton A and every equivalence relation ≈ on its states set, L(A/≈) is a superset of (or equal to) L(A). Given a finite automaton A over some alphabet Σ, an equivalence relation ≈ can be defined on Σ by x ≈ y if ∀ z ∈ Σ: xz ∈ L(A) ↔ yz ∈ L(A). By the Myhill–Nerode theorem, A/≈ is a deterministic automaton that recognizes the same language as A. As a consequence, the quotient of A by every refinement of ≈ also recognizes the same language as A.