Coverage Report

Coverage Report - org.apache.commons.codec.language.bm.BeiderMorseEncoder

Classes in this File

Line Coverage

Branch Coverage

Complexity

BeiderMorseEncoder

100%

19/19

100%

4/4

1.444

 /*
  * Licensed to the Apache Software Foundation (ASF) under one or more
  * contributor license agreements.  See the NOTICE file distributed with
  * this work for additional information regarding copyright ownership.
  * The ASF licenses this file to You under the Apache License, Version 2.0
  * (the "License"); you may not use this file except in compliance with
  * the License.  You may obtain a copy of the License at
  *
  *      http://www.apache.org/licenses/LICENSE-2.0
  *
  * Unless required by applicable law or agreed to in writing, software
  * distributed under the License is distributed on an "AS IS" BASIS,
  * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
  * See the License for the specific language governing permissions and
  * limitations under the License.
  */
 
 package org.apache.commons.codec.language.bm;
 
 import org.apache.commons.codec.EncoderException;
 import org.apache.commons.codec.StringEncoder;
 
 /**
  * Encodes strings into their Beider-Morse phonetic encoding.
  * <p>
  * Beider-Morse phonetic encodings are optimised for family names. However, they may be useful for a wide range
  * of words.
  * <p>
  * This encoder is intentionally mutable to allow dynamic configuration through bean properties. As such, it
  * is mutable, and may not be thread-safe. If you require a guaranteed thread-safe encoding then use
  * {@link PhoneticEngine} directly.
  * <p>
  * <b>Encoding overview</b>
  * <p>
  * Beider-Morse phonetic encodings is a multi-step process. Firstly, a table of rules is consulted to guess what
  * language the word comes from. For example, if it ends in "<code>ault</code>" then it infers that the word is French.
  * Next, the word is translated into a phonetic representation using a language-specific phonetics table. Some
  * runs of letters can be pronounced in multiple ways, and a single run of letters may be potentially broken up
  * into phonemes at different places, so this stage results in a set of possible language-specific phonetic
  * representations. Lastly, this language-specific phonetic representation is processed by a table of rules that
  * re-writes it phonetically taking into account systematic pronunciation differences between languages, to move
  * it towards a pan-indo-european phonetic representation. Again, sometimes there are multiple ways this could be
  * done and sometimes things that can be pronounced in several ways in the source language have only one way to
  * represent them in this average phonetic language, so the result is again a set of phonetic spellings.
  * <p>
  * Some names are treated as having multiple parts. This can be due to two things. Firstly, they may be hyphenated.
  * In this case, each individual hyphenated word is encoded, and then these are combined end-to-end for the final
  * encoding. Secondly, some names have standard prefixes, for example, "<code>Mac/Mc</code>" in Scottish (English)
  * names. As sometimes it is ambiguous whether the prefix is intended or is an accident of the spelling, the word
  * is encoded once with the prefix and once without it. The resulting encoding contains one and then the other result.
  * <p>
  * <b>Encoding format</b>
  * <p>
  * Individual phonetic spellings of an input word are represented in upper- and lower-case roman characters. Where
  * there are multiple possible phonetic representations, these are joined with a pipe (<code>|</code>) character.
  * If multiple hyphenated words where found, or if the word may contain a name prefix, each encoded word is placed
  * in elipses and these blocks are then joined with hyphens. For example, "<code>d'ortley</code>" has a possible
  * prefix. The form without prefix encodes to "<code>ortlaj|ortlej</code>", while the form with prefix encodes to
  * "<code>dortlaj|dortlej</code>". Thus, the full, combined encoding is "{@code (ortlaj|ortlej)-(dortlaj|dortlej)}".
  * <p>
  * The encoded forms are often quite a bit longer than the input strings. This is because a single input may have many
  * potential phonetic interpretations. For example, "<code>Renault</code>" encodes to
  * "<code>rYnDlt|rYnalt|rYnult|rinDlt|rinalt|rinult</code>". The <code>APPROX</code> rules will tend to produce larger
  * encodings as they consider a wider range of possible, approximate phonetic interpretations of the original word.
  * Down-stream applications may wish to further process the encoding for indexing or lookup purposes, for example, by
  * splitting on pipe (<code>|</code>) and indexing under each of these alternatives.
  *
  * @since 1.6
  * @version $Id$
  */
 public class BeiderMorseEncoder implements StringEncoder {
     // Implementation note: This class is a spring-friendly facade to PhoneticEngine. It allows read/write configuration
     // of an immutable PhoneticEngine instance that will be delegated to for the actual encoding.
 
     // a cached object
     private PhoneticEngine engine = new PhoneticEngine(NameType.GENERIC, RuleType.APPROX, true);
 
     @Override
     public Object encode(final Object source) throws EncoderException {
         if (!(source instanceof String)) {
             throw new EncoderException("BeiderMorseEncoder encode parameter is not of type String");
         }
         return encode((String) source);
     }
 
     @Override
     public String encode(final String source) throws EncoderException {
         if (source == null) {
             return null;
         }
         return this.engine.encode(source);
     }
 
     /**
      * Gets the name type currently in operation.
      *
      * @return the NameType currently being used
      */
     public NameType getNameType() {
         return this.engine.getNameType();
     }
 
     /**
      * Gets the rule type currently in operation.
      *
      * @return the RuleType currently being used
      */
     public RuleType getRuleType() {
         return this.engine.getRuleType();
     }
 
     /**
      * Discovers if multiple possible encodings are concatenated.
      *
      * @return true if multiple encodings are concatenated, false if just the first one is returned
      */
     public boolean isConcat() {
         return this.engine.isConcat();
     }
 
     /**
      * Sets how multiple possible phonetic encodings are combined.
      *
      * @param concat
      *            true if multiple encodings are to be combined with a '|', false if just the first one is
      *            to be considered
      */
     public void setConcat(final boolean concat) {
         this.engine = new PhoneticEngine(this.engine.getNameType(),
                                          this.engine.getRuleType(),
                                          concat,
                                          this.engine.getMaxPhonemes());
     }
 
     /**
      * Sets the type of name. Use {@link NameType#GENERIC} unless you specifically want phonetic encodings
      * optimized for Ashkenazi or Sephardic Jewish family names.
      *
      * @param nameType
      *            the NameType in use
      */
     public void setNameType(final NameType nameType) {
         this.engine = new PhoneticEngine(nameType,
                                          this.engine.getRuleType(),
                                          this.engine.isConcat(),
                                          this.engine.getMaxPhonemes());
     }
 
     /**
      * Sets the rule type to apply. This will widen or narrow the range of phonetic encodings considered.
      *
      * @param ruleType
      *            {@link RuleType#APPROX} or {@link RuleType#EXACT} for approximate or exact phonetic matches
      */
     public void setRuleType(final RuleType ruleType) {
         this.engine = new PhoneticEngine(this.engine.getNameType(),
                                          ruleType,
                                          this.engine.isConcat(),
                                          this.engine.getMaxPhonemes());
     }
 
     /**
      * Sets the number of maximum of phonemes that shall be considered by the engine.
      *
      * @param maxPhonemes
      *            the maximum number of phonemes returned by the engine
      * @since 1.7
      */
     public void setMaxPhonemes(final int maxPhonemes) {
         this.engine = new PhoneticEngine(this.engine.getNameType(),
                                          this.engine.getRuleType(),
                                          this.engine.isConcat(),
                                          maxPhonemes);
     }
 
 }

1		/*
2		* Licensed to the Apache Software Foundation (ASF) under one or more
3		* contributor license agreements. See the NOTICE file distributed with
4		* this work for additional information regarding copyright ownership.
5		* The ASF licenses this file to You under the Apache License, Version 2.0
6		* (the "License"); you may not use this file except in compliance with
7		* the License. You may obtain a copy of the License at
8		*
9		* http://www.apache.org/licenses/LICENSE-2.0
10		*
11		* Unless required by applicable law or agreed to in writing, software
12		* distributed under the License is distributed on an "AS IS" BASIS,
13		* WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
14		* See the License for the specific language governing permissions and
15		* limitations under the License.
16		*/
17
18		package org.apache.commons.codec.language.bm;
19
20		import org.apache.commons.codec.EncoderException;
21		import org.apache.commons.codec.StringEncoder;
22
23		/**
24		* Encodes strings into their Beider-Morse phonetic encoding.
25		* <p>
26		* Beider-Morse phonetic encodings are optimised for family names. However, they may be useful for a wide range
27		* of words.
28		* <p>
29		* This encoder is intentionally mutable to allow dynamic configuration through bean properties. As such, it
30		* is mutable, and may not be thread-safe. If you require a guaranteed thread-safe encoding then use
31		* {@link PhoneticEngine} directly.
32		* <p>
33		* <b>Encoding overview</b>
34		* <p>
35		* Beider-Morse phonetic encodings is a multi-step process. Firstly, a table of rules is consulted to guess what
36		* language the word comes from. For example, if it ends in "<code>ault</code>" then it infers that the word is French.
37		* Next, the word is translated into a phonetic representation using a language-specific phonetics table. Some
38		* runs of letters can be pronounced in multiple ways, and a single run of letters may be potentially broken up
39		* into phonemes at different places, so this stage results in a set of possible language-specific phonetic
40		* representations. Lastly, this language-specific phonetic representation is processed by a table of rules that
41		* re-writes it phonetically taking into account systematic pronunciation differences between languages, to move
42		* it towards a pan-indo-european phonetic representation. Again, sometimes there are multiple ways this could be
43		* done and sometimes things that can be pronounced in several ways in the source language have only one way to
44		* represent them in this average phonetic language, so the result is again a set of phonetic spellings.
45		* <p>
46		* Some names are treated as having multiple parts. This can be due to two things. Firstly, they may be hyphenated.
47		* In this case, each individual hyphenated word is encoded, and then these are combined end-to-end for the final
48		* encoding. Secondly, some names have standard prefixes, for example, "<code>Mac/Mc</code>" in Scottish (English)
49		* names. As sometimes it is ambiguous whether the prefix is intended or is an accident of the spelling, the word
50		* is encoded once with the prefix and once without it. The resulting encoding contains one and then the other result.
51		* <p>
52		* <b>Encoding format</b>
53		* <p>
54		* Individual phonetic spellings of an input word are represented in upper- and lower-case roman characters. Where
55		* there are multiple possible phonetic representations, these are joined with a pipe (<code>\|</code>) character.
56		* If multiple hyphenated words where found, or if the word may contain a name prefix, each encoded word is placed
57		* in elipses and these blocks are then joined with hyphens. For example, "<code>d'ortley</code>" has a possible
58		* prefix. The form without prefix encodes to "<code>ortlaj\|ortlej</code>", while the form with prefix encodes to
59		* "<code>dortlaj\|dortlej</code>". Thus, the full, combined encoding is "{@code (ortlaj\|ortlej)-(dortlaj\|dortlej)}".
60		* <p>
61		* The encoded forms are often quite a bit longer than the input strings. This is because a single input may have many
62		* potential phonetic interpretations. For example, "<code>Renault</code>" encodes to
63		* "<code>rYnDlt\|rYnalt\|rYnult\|rinDlt\|rinalt\|rinult</code>". The <code>APPROX</code> rules will tend to produce larger
64		* encodings as they consider a wider range of possible, approximate phonetic interpretations of the original word.
65		* Down-stream applications may wish to further process the encoding for indexing or lookup purposes, for example, by
66		* splitting on pipe (<code>\|</code>) and indexing under each of these alternatives.
67		*
68		* @since 1.6
69		* @version $Id$
70		*/
71	36	public class BeiderMorseEncoder implements StringEncoder {
72		// Implementation note: This class is a spring-friendly facade to PhoneticEngine. It allows read/write configuration
73		// of an immutable PhoneticEngine instance that will be delegated to for the actual encoding.
74
75		// a cached object
76	36	private PhoneticEngine engine = new PhoneticEngine(NameType.GENERIC, RuleType.APPROX, true);
77
78		@Override
79		public Object encode(final Object source) throws EncoderException {
80	97	if (!(source instanceof String)) {
81	1	throw new EncoderException("BeiderMorseEncoder encode parameter is not of type String");
82		}
83	96	return encode((String) source);
84		}
85
86		@Override
87		public String encode(final String source) throws EncoderException {
88	67090	if (source == null) {
89	1	return null;
90		}
91	67089	return this.engine.encode(source);
92		}
93
94		/**
95		* Gets the name type currently in operation.
96		*
97		* @return the NameType currently being used
98		*/
99		public NameType getNameType() {
100	1	return this.engine.getNameType();
101		}
102
103		/**
104		* Gets the rule type currently in operation.
105		*
106		* @return the RuleType currently being used
107		*/
108		public RuleType getRuleType() {
109	1	return this.engine.getRuleType();
110		}
111
112		/**
113		* Discovers if multiple possible encodings are concatenated.
114		*
115		* @return true if multiple encodings are concatenated, false if just the first one is returned
116		*/
117		public boolean isConcat() {
118	1	return this.engine.isConcat();
119		}
120
121		/**
122		* Sets how multiple possible phonetic encodings are combined.
123		*
124		* @param concat
125		* true if multiple encodings are to be combined with a '\|', false if just the first one is
126		* to be considered
127		*/
128		public void setConcat(final boolean concat) {
129	1	this.engine = new PhoneticEngine(this.engine.getNameType(),
130		this.engine.getRuleType(),
131		concat,
132		this.engine.getMaxPhonemes());
133	1	}
134
135		/**
136		* Sets the type of name. Use {@link NameType#GENERIC} unless you specifically want phonetic encodings
137		* optimized for Ashkenazi or Sephardic Jewish family names.
138		*
139		* @param nameType
140		* the NameType in use
141		*/
142		public void setNameType(final NameType nameType) {
143	11	this.engine = new PhoneticEngine(nameType,
144		this.engine.getRuleType(),
145		this.engine.isConcat(),
146		this.engine.getMaxPhonemes());
147	11	}
148
149		/**
150		* Sets the rule type to apply. This will widen or narrow the range of phonetic encodings considered.
151		*
152		* @param ruleType
153		* {@link RuleType#APPROX} or {@link RuleType#EXACT} for approximate or exact phonetic matches
154		*/
155		public void setRuleType(final RuleType ruleType) {
156	12	this.engine = new PhoneticEngine(this.engine.getNameType(),
157		ruleType,
158		this.engine.isConcat(),
159		this.engine.getMaxPhonemes());
160	11	}
161
162		/**
163		* Sets the number of maximum of phonemes that shall be considered by the engine.
164		*
165		* @param maxPhonemes
166		* the maximum number of phonemes returned by the engine
167		* @since 1.7
168		*/
169		public void setMaxPhonemes(final int maxPhonemes) {
170	1	this.engine = new PhoneticEngine(this.engine.getNameType(),
171		this.engine.getRuleType(),
172		this.engine.isConcat(),
173		maxPhonemes);
174	1	}
175
176		}