Considering Machine Translation (MT) as an Aid or a Threat to the Human Translator: The Case of Google Translate

The present study aims to evaluate the output quality of an online MT; namely, Google Translate, from English into Persian and compare its output with the translations made by the human translators to find out that whether MT applications are considered as an aid or a threat to human translators. For the application of the study, the researcher designed a translation test consisting of 60 statements from different types of texts proposed by Reiss (1989). The translation test was translated via Google Translate and administrated to three human translators to be rendered. The translations made by Google Translate and by the three human translators alongside the 60 statements were given to 40 judges to be evaluated based on Dorr et al. s' (2010) criterion of MT quality assessment, including semantic adequacy, fluency, and understandability. As results indicated, Google Translate gave a good overall performance in the translation of the 60 statements into Persian in terms of semantic adequacy and understandability, but not in terms of fluency. Thus, there should be no fear of human translators being replaced by the MT. As conclusion, MT applications cannot be considered a threat to human translators, but as an aid for them. The present study also offers some recommendations that can be beneficial to translation students, trainee translators, translation teachers, and professional translators.


Introduction
Using conventional methods, such as paper dictionaries and typewriters, cause translators not only to spend more time and energy but also produce costly translations. For example, translators have to use typewriters or record their own voices to be typed by typists later (Kay as cited in Abdi, 2019). Granell (2015) states that to produce high-quality translations in ever-shorter time, translators need to take into consideration quality and time requirements. In other words, traditional tools, as Kay infers, are in need of a fundamental change to be used in the translator's workstation.
The emergence of technology has brought about major changes in the translation industry the extent to which almost all traditional tools were replaced by modern ones. This has led to a shift in translators' attitude towards the use of new translation methods to produce high-quality translations in a short time. The development of many types of translation tools was triggered by "the demand for a variety of translations by different groups of end-users" (Quah, 2006, p. 1). Machine Translation (MT) is "a sub-discipline of computational linguistics or, one could say, one of the latter's flagship application areas" (Nirenburg, 2003, p. 3).
The invention of MT caused great fears for most translators due to the assumption that the aim of creating MT, as Abdi (2019) states, was to replace the human translator. In this context, Granell (2015) implies that the use of MT instead of human translator has always been anticipated, and this anticipation will be continued. Hutchins and Somers (1992) argue that such fear was built on "a belief that computerization necessarily implies a loss of humanity" (p. 149). Hunt (2002) does not regard MT as a threat to the human translator because he believes that "computers will never replace translators, but translators who use computers will replace translators who don't" (p. 49).
For some time, access to MT was possible at a high cost for most translators. Thus, online MTs, such Google Translate, appeared to be freely available to translators. These online tools were welcomed by translators because of the opportunity they provided for them to save time and reduce translation costs. The important question that has always been raised is that whether they are able to produce high-quality translations. A high-quality translation is well defined by Koby et al. (2014, p. 416) as follows: A quality translation demonstrates accuracy and fluency required for the audience and purpose and complies with all other specifications negotiated between the requester and provider, taking into account end-user needs.
As it is clear from the above definition, accuracy and fluency are two key paradigms that should be taken into consideration. Koby et al. (2014) argue that these two paradigms are kinds of quality that can evaluate and fulfill the purpose of the translation. Evaluating the output quality of MT is expensive and time consuming due to many aspects of translation, such as adequacy, fidelity, fluency, and need to be evaluated (Hovy, 1999). Despite different types of automatic evaluation metrics, human evaluation, as Han (2018) discusses, is "usually trusted as the golden standards." Popovic et al. (2013) state that human translators are "the key to evaluating MT quality and also to addressing the so far unanswered question when and how to use MT in professional translation workflows" (p. 231).

The Present Study
Dorr et al. (2010) consider three important paradigms for evaluating the output quality of MT, including semantic adequacy, fluency, and understandability, of which semantic adequacy is the most important paradigm that "is widely regarded as the Gold Standard for assessment of MT quality" (p. 811). In the light of these criteria of MT quality assessment, the aim of the present study was to evaluate the output quality of an online MT, in this case Google Translate, from English into Persian. This has led to find out that whether Google Translate is considered as an aid or a threat to human translators. To achieve the objective of the present study, the researcher attempted to answer the following questions: According to Grajales (2015), Google Translate is "a well-known app from Google, which works as a multi-language functional translator." Lotz and Van Rensburg (2014) imply that Google Translate is "a free online application, offered by Google Inc. that allows users to have words, sentences, documents and even websites translated in an instant"(p. 237). They describe that Google Translate encompasses computer systems that produce translations based on patterns including large amounts of text, not on a series of rules for a particular language. In other words, a large amount of texts has been stored in the storage of Google Translate to be retrieved by translators when they are in need.
Wu et al, (2016) argue that Google Translate has changed its approach from the Statistic Based Machine Translation (SBMT) to the Neural Machine Translation (NMT) to bridge the gap between human and machine translation and deal with the problems MT faces, such as rare words. In this context, Franz Josef Och (cited in Helft, 2010), a German computer scientist, implies that this technology can remove the language barrier, and it would provide the opportunity for anyone to communicate with anyone else.

Human Judgments for Manual Assessment of MT Quality
Human judgments are effective methods for developing the function of MT. In this regard, Coughlin (2001) mentions that "human evaluation has been the only means of providing the necessary feedback to keep development moving forward" (p. 63). Graham et al. (2013) argue that human evaluation for manual assessment of MT quality is "a key element in the development of machine translation systems" (p. 16). Evaluation of MT involving human judgments is "an extremely demanding task" (Brkic et al., 2013, p. 1). Furthermore, such evaluation, as Dorr et al. (2010) argue, causes two main concerns that must be taken into consideration.
The first concern is that different answers, as Dorr et al. (2010) discuss, may be given by humans and different opinions may be expressed by them for the same evaluation. Thus, a panel of independent judges is needed to calculate the average of those differences. The second concern, as Dorr et al. imply, is related to the familiarity of the judges with the subject matter and/or sub-languages that can affect evaluation. In spite of time-consuming and costly as well as such concerns about human judgments, human evaluation is "a gold standard for evaluation of MT quality" (Brkic et al., 2013, p. 1) and used as "a baseline by which evaluation metrics are frequently judged"; therefore, there is no other evaluation metric to replace human judgments of translation (Dorr et al., 2010, p. 808).

Dorr et al. s' Criterion of MT Quality Assessment
To assess the output quality of MT, many paradigms need to be observed, such as semantic adequacy, fluency, and understandability (Dorr et al., 2010). According to Callison-Burch (2007), the first two are the most common paradigms that are used in human evaluation method.
The most important paradigm, as Dorr et al. (2010) imply, is semantic adequacy (or fidelity) which seeks the answer to the question: does translation have the same meaning as the source-language material? Semantic adequacy evaluation of the MT output has nothing to do with fluency judgment and ignores it because adequacy evaluation "measures whether the essential information in the source can be extracted from the system output" (p. 808). Moreover, evaluating semantic adequacy, as they state, is more challenging than evaluating accuracy due to that the evaluator needs to be bilingual in order to judge whether information is correctly transferred from the source language (SL) to the target language (TL).
Fluency is also considered an important paradigm, so that translation should be something that native speaker of the TL would say or write alongside its understandability (Dorr et al., 2010). They discuss that the SL input does not require judge fluency, so that "ratings for fluency are usually obtained from monolingual judges" (p. 808). Thus, the evaluator, as they argue, needs to be a fluent speaker in TL to judge whether translation is fluent, without taking into account the accuracy of the translation. In order to judge fluency and adequacy, a five-or seven-point scale should be prepared, and they should also be measured separately on each sentence in the system output (Przybocki cited in Dorr et al., 2010). In this context, Dorr et al. (2010) imply that typicall the human judgments are provided with a multi-point scale, but they are sometimes asked to make their judgments on some sort of numeric scale that leads to correlation/regression analysis.
Understandability is another paradigm that should be taken into consideration. Meaning of translation may be completely and correctly conveyed but may be so awkward and difficult to understand (Dorr et al., 2010). For example, a literal translation may be used for a SL idiom that makes its meaning understandable to target readers. To judge understandability, it is enough to receive a yes/no response from judges according to Dorr et al. (2010).
To judge the above paradigms, judges can be monolingual or bilingual. If they are monolingual, they need to be given one or more high-quality reference human translation; and if they are bilingual, they should be provided with the SL material.

Participants
The participants of the present study were 40 judges who were invited to evaluate the output quality of Google Translate on the basis of a five-point scale prepared to examine the three paradigms proposed by Dorr et al. (2010). As the official website of Iranian Association of Certified Translators and Interpreters (IACTI) was available on the Internet (https://www.iacti.ir), the researcher decided to choose the judges from certified translators. The logic behind this selection was that they were qualified by the competent authorities in advance that facilitated the process of selection of the judges/participants for the present study. Moreover, such certified translators helped the researcher fulfil the objectives of the present study and achieve the intended results. Before they judged, they were informed about the subject matter sufficiently to obtain valid judgments.

Instruments
A translation test consisting of 60 statements was prepared for data collection. The statements were extracted from different types of texts proposed by Reiss (1989). Thus, all statements were divided into three categories: informative, expressive, and operative texts. Each category was organized into four sub-categories: simple, complex, compound, and complex-compound statements. In other words, five statements of each type were chosen for each category. This led to examine the function of Google Translate in the translation of not only different text types but also types of statements and find out the weaknesses and strengths of Google Translate. All statements were extracted from online websites, such as news websites and online dictionaries. To validate the translation test, a panel of university professors, who had teaching experience in Translation Studies, was asked to determine the content validity of the statements. The feedback received from translation teachers was constructive and helped to improve the translation test in terms of rewording and reordering of the statements. The test-retest method was applied to measure the reliability of the questionnaire. It was given to 10 certified human translators. After two weeks, the test was administrated to the same translators. The results of the two trials were correlated and the coefficient of correlation indicated the reliability of the test (r = .802).

Procedure
The following steps were established for data collection: first, the translation test was translated via Google Translate. The translations alongside the 60 statements were given to the judges to evaluate the output quality of Google Translation based on Dorr et al. s' (2010) criteria of MT quality assessment. Thus, a five-point scale on a continuum from "None" to "All" for evaluating semantic adequacy was prepared that showed how much the meaning was correctly expressed in the translation. A five-point scale on a continuum "Incomprehensible" to Flawless Persian" was also designed to find out the extent to which the translation was fluent. The understandability of the output of Google Translate was made based on the "Yes/No" judgment. To analyze the data of the present study, the frequencies and percentages of each continuum chosen by the judges were calculated and illustrated in tabulation forms. Inferential statistics, such as one sample Wilcoxon signed ranks test, were employed to justify the hypothesis.

Semantic adequacy
As   One sample Wilcoxon signed ranks test was run to test the hypothesis. In other words, the test allowed the researcher to see whether there was a significant relationship between the judges' agreements and the semantic adequacy of Persian translations of each statement type produced by Google translate. alongside the mean scores of the expressive and operative statements, illustrates that the judges' agreement with the Persian translations of these two types is not significant, indicating average semantic adequacy (ME = 3.12; MO = 3.07). The mean score of informative statements was higher than the mid-answers (i.e. 2.5) to the sentences, considering the theoretical mean/median of the population. The hypothesis was also rejected due to the value of p that was lower than .05 (p < .05). This implies that the judges' agreement with semantic adequacy of the informative statements is significant and Google translate mostly produced adequate informative translations (MI = 3.85). In general, as Table 1 indicates, the judges were in full agreement about the semantic adequacy of almost two-thirds of the 60 statements (68%); whereas they disagreed with semantic adequacy of 31% of the statements. One sample Wilcoxon signed ranks test was also employed to test the hypothesis, whether the agreement of the judges with the degree of semantic adequacy of the 60 Persian translations that Google translate produced was significant. To do so, the sign value of this test was calculated. According to Table 3, the p value of the test was .0 that was lower than .05 (p < .05). Thus, the hypothesis was rejected. Furthermore, the mean score of total answers to the 60 statements was 3.34 out of the highest value 5 that was higher than the mid-answers to the sentences (3.34 > 2.5) (see Table 1). In other words, the judges in general significantly agreed with the semantic adequacy of the Persian translation of the 60 statements made by Google translate (MT = 3.34). Note. N= total number of statements; MDN= median; The sig value of one sample Wilcoxon signed ranks test is significant at p < .05.

Fluency
The judges did not agree with fluency of 61% of the informative statements. That is, 31% of the statements was considered incomprehensible and disfluent, and 30% non-native English. For the judges, only 39% of the sentences was fluent (4% was flawless Persian and 35% good Persian) (see Table 4). According to answer distribution, all Persian translations of expressive statements put into the following continuums: incomprehensive (37%), disfluent Persian (24%), non-native Persian (21%), good Persian (15%), and flawless Persian (3%). According to the judges' opinion, a great majority of the operative translations (73%) was incomprehensible (37%), disfluent (26%), and non-native Persian (10%). By contrast, a small number of operative Persian translations (27%) was produced in a fluent manner (9% was flawless Persian and 18% good Persian). Table 4 demonstrates the agreement of the judges with the disfluency of two-third of the 60 Persian translations (28% was incomprehensible, 24% disfluent, and 20% non-native Persian). The judges gave scores to 22% as good Persian, and to 5% as flawless Persian.    One sample Wilcoxson test was run to test the hypothesis in order to see the extent to which the judges expressed their agreements with the fluency of each statement type and whether there was a significant relationship between the judges' agreements and the fluency of Persian translations of each statement type. The results of the test indicate that the p values relating to expressive and operative Persian translations were .07 and 1.0 that seems to be 0% 10% 20% 30% Percentage Continuum higher than .05 (p > .05) (see Table 5). Thus, the hypothesis was not rejected. This points to that there is no significant relationship between the judges' opinions and the fluency of Persian translations of these two types of statements. Moreover, the mean scores of given answers to these translations were lower than theoretical mean/median (i.e. 2.5). This shows disfluency of the expressive and incomprehensibility of operative translations that Google translate produced (ME = .07, MO = 1.0). In contrast, the results show the amount of p value of informative statements lower than .05 (p < .05). Hence, the hypothesis was rejected, and the judges' opinion was significantly different from the theoretical mean/median. According to Table 4, the mean score of this sentence type was higher than the theoretical mean/median (i.e. 2.5) that gave the indication of average fluency of the Persian translations made by Google translate (MI = 3.02). Note. N= total number of statements; MDN= median; The sig value of one sample Wilcoxon signed ranks test is significant at p < .05.
Considering the result of total answers to the 60 statements, one sample Wilcoxson test was also applied to test the hypothesis. That is, there was a significant relationship between the judges' agreements and the fluency of the 60 Persian statements made by Google translate. Table 6 indicates that the p value was higher than .05 (p > .05). As a result, the hypothesis was retained. In addition, the total mean score of the given answers was 2.53, almost equal to the theoretical mean/median (i.e. 2.5). Thus, the judges' agreement was not significant and not differs from the theoretical mean. This indicates an average fluency of the Persian translations that Google translate produced (MT = 2.53).

Understandability
As Table 7 shows, more than half of the statements (55%) was understandable; whereas 44% of them was not easy to understand from the judges' viewpoints.    Table 8 illustrates the results of one sample Wilcoxon test. The test was used to test the hypothesis, whether a relationship between the judges' agreements and the understandability of the 60 Persian statements was significant. Based on Table below, the p value of the test was higher than .05 (p > .05). Hence, the hypothesis was not rejected, and the judges' opinion was not significantly different from the theoretical mean/median (i.e. 1.5). The mean score of the given answers was lower than (almost equal to) the theoretical mean/median, showing average understandability (MT = 1.44).

Discussion
In the previous section, the output quality of Google Translate was analyzed in terms of semantic adequacy, fluency, and understandability. This section discusses the results derived from the analysis and reports the findings of some recent studies in the field.
According to the results, Google Translate had significant performance in the translation of almost two-thirds of the 60 statements into Persian. In other words, the highest scores given to the output of Google Translate were Almost and All (see Examples 1 and 2). It means that the meaning of the ST is completely conveyed to the TT. In spite of a few number of translations that were given the lowest scores, the fragments of the ST meaning were appeared in the TT (see Example 3). Zakaryia (2020) reports the same findings in his doctoral dissertation. The findings derived from the comparison between the output quality of the three MTs, one of which was Google Translate, in 2016 and 2017. According to the results, the score for Google translate was everything score in both years and the performance of Google Translate was better in 1970 than 1960. Understandability was the last paradigm that the judges were asked to evaluate. Dorr et al. (2010) imply that the quality of semantic adequacy is the most important paradigm that needs to be judged, but this quality itself "is measured in an interestingly indirect way that encompasses the understandability of the translation" (p. 814). It is rather to say, there is an indirect relation between semantic adequacy evaluation and understandability evaluation. Based on the results, the translations made by Google Translate were informative and easy to understand to some degree (see Example 7). This indicates not only the average clarity of the Google Translate output but also the quality of semantic adequacy of the Persian translations. The main shortage of Google Translate was in the translation of statements that contained implied meanings, such as operative texts and idioms. In such cases, the translations made by Google Translate were in need of post-editing because neither their meanings were fully conveyed nor they were fluent and comprehensible. Furthermore, Google Translate is a rule-based MT and works based on some defined rules. This affects the output quality of such a popular online MT.

Conclusion
Notwithstanding automatic evaluation of the output quality of MT that is considered a costeffective method, human evaluation has some advantages one of which is the capability of the human evaluators to "perform some tasks that are currently beyond the reach of automated metrics," such as monitoring the quality of translated individual sentences with high accuracy (Coughlin, 2013, p. 69). Hence, the present study attempted to evaluate the output quality of Google Translate to see whether MTs are considered an aid or a threat to human translators. As the results indicate, translations Google Translate produced were acceptable to a certain degree. In a wider sense, the performance of Google Translate was partly satisfactory but not the extent to which it is used alone. However, the output of Google Translate needs to be edited by human translator to reach a high-quality output. To approve this, the researcher relies on the report from the Automatic Language Processing Advisory Committee (1966) where it is clearly mentioned that "it was not possible to obtain a translation that was entirely carried out by a computer and of human quality" (cited in Delpech, 2014, p. 4). Post-editing, as Hutchins (1996) implies, is the only solution to reach a good quality of translation. In conclusion, the assumption that was made on the replacement of MTs with human translators is rejected and MTs are considered an aid not a threat to human translators.
The present study offers some pedagogical implications that can be helpful to translation students, trainee translators, translation teachers, and professional translators.
To translation students and trainee translators, it is recommended that they improve the skill of using not only MTs but also other technological tools appropriately. They are not recommended to rely on Google Translate alone to produce high-quality translations. To reach good-quality translations, they need to apply their own skills alongside Google Translate. The latter helps to reduce costs and time of the translation while the former provides the translator with the opportunity to enhance translation quality.
Teaching new technologies is the task that translation teachers are responsible for. They need to take serious attention to translation practices alongside translation theories. To do this, translation teachers can encourage students to attend seminars and workshops held by experts in the field to be familiar with new technologies and broaden the knowledge of using such tools. By designing new materials, encompassing modern technological tools as well as giving translation tasks that involve students with Google Translate and other types of MTs, translation teachers can teach students to employ MTs in their translation instead of using traditional methods.
The role the professional translators play is of great importance to the improvement of Google Translate function. Thus, they are kindly recommended to make contribution to enhance the output quality of this popular MT by validating translations. That is, professional translators select the correct translation from those offered by Google Translate for the individual statement, especially those encompassing implied meanings, such as proverbs and idioms. This helps not only improve the function of Google Translate but also make it possible for people around the world understand language a little better.

Conflicts of Interest:
The authors declare no conflict of interest.
The funding sponsors: The founding sponsors had no role in the design of the study; in the collection, analyses, or interpretation of data; in the writing of the manuscript, or in the decision to publish the results.